1. Introduction
Intelligence (defined as Spearman’s
g) is a main predictor of academic performance across domains (e.g.,
Deary et al., 2007;
Watkins et al., 2007). It is considered one of the most thoroughly researched psychological constructs and thus offers a strong foundation for identifying gifted students (
Rost & Sparfeldt, 2017). Although a modern understanding of giftedness goes beyond above-average intelligence, many definitions see it as a necessary condition or potential that enables individuals to contribute to societal development through their actions (e.g.,
Subotnik et al., 2011;
Worrell et al., 2019). Furthermore, it is essential to consider domain-specific abilities, psychosocial skills, motivation, and environmental influences, along with strategies to address these factors, as they enable the individual to translate potential into transformative giftedness (
Sternberg, 2024).
Subotnik et al. (
2011) provided the following definition:
Giftedness is the manifestation of performance or production that is clearly at the upper end of the distribution in a talent domain even relative to that of other high-functioning individuals in that domain. Further, giftedness can be viewed as developmental, in that in the beginning stages, potential is the key variable; in later stages, achievement is the measure of giftedness; and in fully developed talents, eminence is the basis on which this label is granted. Psychosocial variables play an essential role in the manifestation of giftedness at every developmental stage. Both cognitive and psychosocial variables are malleable and need to be deliberately cultivated.
(p. 7)
Therefore, identification processes must consider both potential and achievement, must be designed in accordance with educational goals, and employ multiple stages to shape programmes that foster learning strategies, motivation, and interest in specific domains (
VanTassel-Baska, 2005,
2021). On the potential stage, broad cognitive abilities, such as fluid reasoning, should be measured to identify gifted individuals. However, current instruments mainly consist of domain-general tasks or abstract forms, which do not fully align with the domain-specific nature of giftedness. In the current study, contextualised matrix reasoning tasks for biology were developed and tested with students in grades 3–6 to assess their psychometric properties and ability to identify the potential of gifted students in biology.
Intelligence should be viewed not merely as a psychological attribute but as a network of diverse individual strengths and weaknesses across different broad cognitive abilities (
McGrew et al., 2023). Fluid intelligence (
Gf) is the reasoning ability that enables individuals to handle unfamiliar situations or problems and refers to the ability to solve problems and to recognise patterns or relations (
Cattell, 1963;
Schneider & McGrew, 2022). It is usually measured using matrix reasoning tests. These tests involve tasks in which a matrix, typically 3 × 3 or 2 × 2, or a row of fields (e.g., 1 × 5) is presented, with one field left empty for participants to fill from a set of options. The completion of these tasks requires inductive, deductive, and relational reasoning, along with spatial visualisation and working memory (
Carpenter et al., 1990;
Prabhakaran et al., 1997). The most well-known examples are the Raven’s Progressive Matrices (RPM;
Raven et al., 1998), the Culture Fair Test (CFT 20-R;
Weiß, 2006), the Bochumer Matrizentest (BOMAT;
Hossiep & Hasella, 2010), or the Wechsler scales (WISC-V, WAIS-5;
Wechsler, 2014,
2024).
Research at a domain-specific level has shown that
Gf, assessed by matrix reasoning tests, plays a significant role in both mathematical and verbal skills (
Peng et al., 2019).
Ren et al. (
2015) found that measuring fluid intelligence can provide insights into learning abilities, making matrix reasoning tests a valuable addition to educational assessments. Based on the results, educational practitioners can draw direct conclusions about important aspects of their students’ learning (
Klauer & Phye, 2008). In addition, there is an opportunity to gain deeper insights into the potential of their students beyond measures of academic achievement or performance (
Freund & Holling, 2011a,
2011b). Given this strong empirical foundation and the large number of available tests, educators rely heavily on general intelligence tests to identify gifted students. Nevertheless, researchers emphasise the importance of considering domain-specific skills, creativity, and non-intellectual factors such as special interests and motivation when identifying gifted students (
Subotnik et al., 2011). The main goal of gifted education is to foster all students according to their individual potential, considering real later-life outcomes (
Sternberg, 2002). This understanding aligns with the non-
g psychometric network analysis (PNA) approach in contemporary intelligence research, which posits that intelligence is formed through the interaction of multiple broad abilities (e.g.,
McGrew et al., 2023). Therefore, identifying giftedness should not rely on measuring a single, unidimensional construct to differentiate between gifted and non-gifted individuals. This also means that gifted students do not have to perform above average on every metric used (as a high IQ would require;
Peters et al., 2020). Instead, the individual expression of giftedness should be identified and fostered. To achieve this, broad cognitive abilities, such as
Gf, must be assessed inclusively, culturally sensitive, and fairly across all students (
Holden & Tanenbaum, 2023).
Gifted individuals interested in a specific area are more likely to use their potential to develop domain-specific skills. Therefore, these are essential for giftedness identification because they indicate the development and expression of potential in a specific area of interest (
Subotnik et al., 2011;
VanTassel-Baska, 2005). The ability to perform scientific inquiry processes, epistemological views (Nature of Science), and practical skills (scientific working techniques) are core competencies in biology (
Arnold et al., 2014;
Wellnitz & Mayer, 2011). These include scientific reasoning skills like formulating questions and hypotheses, planning experiments, and drawing conclusions, as well as inquiry methods such as skills in observing, collecting data, or systematically controlling and varying variables in experiments or models (e.g.,
Bell et al., 2010;
Bybee et al., 2009;
OECD, 2023;
Pedaste et al., 2015). Research and debates on direct relationships between (fluid) intelligence and domain-specific abilities also extend to STEM subjects (science, technology, engineering, and mathematics;
Brookman-Byrne et al., 2019;
Richland et al., 2007;
Yuan et al., 2006).
Greiff and Neubert (
2014), in a study with 490 high school students, found evidence for the relationships between fluid intelligence, measured with the CFT 20-R (
Weiß, 2006), and complex problem solving (CPS) in scientific inquiry processes, as a domain-specific ability, assessed using MicroDYN tasks (
Greiff et al., 2012).
Scherer and Tiemann (
2014) confirmed the relationship between fluid intelligence and CPS in science among students in grades 8 to 10. Other studies suggested that intelligence has only weak to moderate correlations with ability domains such as scientific reasoning (
Sternberg et al., 2019). It is noted that the strength of the correlation is highly dependent on the measurement method. The results indicate that, for matrix reasoning tests to be used meaningfully in domain-specific research and identification processes, they must be valid and reliable (
Van Hoogdalem & Bosman, 2024).
At the potential level, mainly domain-general measures are currently used in giftedness identification processes, which are mostly supplemented with domain-specific achievement measures to serve the educational purpose. One emerging option is to contextualize intelligence measures within a specific field, thereby integrating domain-general and domain-specific cognitive processes (
Roberts, 2007). In their study,
Benit and Soellner (
2012) adapted a matrix reasoning test for mechanical engineering and, with a sample of 360 university students, demonstrated that participants’ willingness to complete the test could be significantly increased. The measurement of domain-specific cognitive processes at the potential level currently receives little attention in giftedness identification and there is a lack of validated, school-age, domain-specific matrix reasoning tests that accurately measure fluid reasoning.
In the current study, a domain-specific matrix reasoning test in biology was developed within a design-based research (DBR) framework to improve identification processes of gifted students in schools (
Peperkorn & Wegner, 2024). The DBR framework involves identifying a problem in educational practice, developing a prototype to address it based on a preliminary examination, and then evaluating and refining it through multiple research cycles. This process yields both practical outputs and contributes to existing theories (
Euler, 2014;
Shavelson et al., 2003). Therefore, matrix reasoning tasks were designed with biological themes to assess students’ fluid reasoning abilities in grades 3–6 using a domain-specific approach. The age group was selected because, during the transition from primary to secondary school, a decline in interest in science is observed (
Gebhard et al., 2017;
Potvin & Hasni, 2014). In Germany, students typically transition between 4th and 5th grade, although in two states, it occurs after sixth grade. Identifying and fostering gifted students is particularly important during this phase. The assessment is intended for classroom group administration in low-stakes environments. The primary objective of the study was to pilot the test, investigate its psychometric properties, and assess the quality of the developed items. Different forms of the newly developed matrix test were used in two studies. The matrix reasoning tasks was administered across different cohorts, and accompanying instruments were used to gather preliminary evidence of validity. References to IQ and skills in scientific inquiry processes in biology were examined to analyse subject-specific contextualisation and the associations with subject-specific achievement. In addition, group comparisons were conducted to examine the suitability for subject-specific giftedness identification. The following research questions were posed:
RQ1. What is the psychometric quality of the developed domain-specific matrix reasoning tasks across both test versions?
RQ2. How are the results of the domain-specific matrix reasoning tasks related to IQ and abilities in scientific inquiry processes in biology?
RQ3. Do gifted students in an enrichment program show different ability levels in completing domain-specific matrix reasoning tasks compared to a control group?
2. Materials and Methods
The present study is a quantitative cross-sectional study aimed at examining the psychometric quality of domain-specific matrix reasoning tasks. The study was conducted in Germany as part of an enrichment program to foster gifted students in biology (
Wegner et al., 2013) and in cooperating schools of the project. Two research cycles were conducted following the DBR methodology (
Peperkorn & Wegner, 2024;
Euler, 2014;
Shavelson et al., 2003). In the first study, a 24-item test version was administered. In the second study, an expanded version comprising 60 items was used.
2.1. Participants
The total sample consisted of
N = 895 students (41.7% female, 54.8% male, 3.5% N/A, mean age = 10.1 years) from the third to sixth grade. The overall sample was divided into two studies. After piloting the initial version, a second expanded test version was used. In both samples, the participants stemmed from two separate cohorts. The first cohort consisted of participants in an enrichment program (
Wegner et al., 2013) who were identified by their biology teachers. The second cohort was assembled from students attending partner schools. We used a non-probability convenience sample recruited through collaborating teachers. Students were recruited from primary schools (grades 1–4) and from secondary schools, namely academic-track secondary schools (Gymnasium; grades 5–13), from urban and rural areas. No participants of the enrichment program were part of the second cohort.
The sample of the first study included n1 = 470 students (42.3% female, 55.7% male, 2.0% N/A, mean age = 10.08 years). Of these, 373 were participants from the enrichment program, and 97 were students from participating schools. In the first study, we contacted six teachers from three different schools within our collaboration network. Three teachers agreed to participate, yielding a 50% participation rate. This resulted in four classes (grades 5–6) participating. All students present in the participating classes were invited. A total of 104 students were eligible, 99 provided parental consent, and 97 completed the assessment, resulting in a student participation rate of 93.3%. Two students who provided consent were absent during the study.
The sample of the second study included n2 = 425 students (40.9% female, 53.9% male, 5.2% N/A, mean age = 10.12 years). Of these, 341 were participants from the enrichment program, and 84 were students from participating schools. In the second study, we contacted three teachers from two different schools. All teachers agreed to participate, yielding a 100% participation rate. This resulted in three classes (grades 5–6) in which all students present were invited. A total of 87 students were eligible, 84 provided parental consent, and all completed the assessment, resulting in a student participation rate of 96.6%.
2.2. Data Collection Tools
2.2.1. Domain-Specific Matrix Reasoning Test for Biology
Domain-specific matrix reasoning tasks with biological references were developed. Biological forms, such as animals, plants, natural phenomena, or laboratory materials, replaced common abstract shapes. Items were created in four formats (see
Figure 1), including 2 × 2, 3 × 3, and 1 × 5 matrices, as well as patterns with cut-out fields to be completed (
Matzen et al., 2010). Five answer options were created for each item, including one correct answer and four distractors. This choice aligns with standard practices in school-age matrix reasoning tests (e.g., WISC-V; CFT 20-R; RPM). To ensure fairness across diverse student groups, the nonverbal items were designed so that their solutions do not require prior knowledge. The items can be enlarged at will or displayed in grayscale. For the first version, 24 items were developed. These were presented in a fixed order in the first study. In the second study, the initial item pool of 24 items was expanded to 60 items, and the items were presented in randomised order. The students were introduced to the items by answering trial items of each type and receiving automatic feedback on whether their answers were correct. Item 1 was used as an example during test administration and was therefore excluded from all analyses. The items were created and administered digitally using LimeSurvey, with processing on tablets (2732 × 2048 pixels). During administration, participants received no feedback on the correctness of their answers and were unable to skip items or revise submitted answers. Participants were given 20 min to complete the test in both versions.
2.2.2. Raven’s Progressive Matrices 2, Clinical Edition—German Short Form (NCS Pearson, 2019)
Participants’ IQ was assessed using the German digital short form of Raven’s Progressive Matrices (
NCS Pearson, 2019). The matrix reasoning test is used to evaluate overall cognitive abilities, with a primary focus on fluid intelligence. It has been validated with a European norm sample and is appropriate for assessing individuals aged between 4:0 and 69:11 years. The digital short form demonstrated a test-retest reliability between
r = 0.79 and
r = 0.81 (
NCS Pearson, 2019). The test duration is limited to 20 min.
2.2.3. Abilities in Scientific Inquiry Processes in Biological Research Contexts
The assessment was used to evaluate the student’s abilities in scientific inquiry processes as domain-specific abilities. Scientific inquiry was assessed using the VerE model (
Nowak et al., 2013). This theoretical framework encompasses scientific reasoning, including the ability to formulate hypotheses and research questions, plan and perform investigations, and draw conclusions; as well as inquiry methods such as observing, comparing, and arranging, experimenting, and modelling, as overlapping dimensions. The items consisted of a brief description, a visualisation, and a question about the biological phenomenon, model, or experiment. The instrument comprised 54 multiple-choice items, of which 18 were administered in a digital test version. The test instrument was developed as part of the research project. Item analyses were conducted and psychometric properties were assessed for the current sample (grades 3–6; KR-20 = 0.632–0.742). DIF analyses for gender and grade level, divided by student group, showed no significant DIF for any of the items. A translated example was as follows: “A chameleon is observed in its terrarium. You can see that the chameleon turns darker as soon as a conspecific approaches, it is fed, or touched. Which assumption can be verified through the described observation? The chameleon (a) changes its colour in different situations to communicate (correct answer); (b) only changes colour when threatened; (c) only changes colour when it is hungry; (d) becomes brighter when it is touched.” To prevent excessive demands through the description texts, a read-aloud function has been implemented. The assessment was also administered via LimeSurvey and took approximately 25 min to complete.
2.3. Procedure
The administration was conducted in two different settings. The cohort of participants in the enrichment program was surveyed during their project courses, with groups consisting of 15–20 students. The participants were equipped with tablets. The control cohort completed surveys in groups of 25–30 students during lessons at the participating schools. If school-owned tablets were available, they were used. Otherwise, the research team provided participants with tablets. The surveys were conducted following a standardised procedure. Each survey session took approximately 30 min, including greetings, introductions, process explanations with a trial item, execution, and farewell. Testing was conducted by trained research staff using a standardised protocol. To ensure comprehension, examiners read the general instructions aloud to all students, administered trial items with feedback, and explained that students could choose and, if necessary, change an answer before submitting. It was explained that submitted answers cannot be corrected. During the test, procedural support was limited to rereading or paraphrasing the general instructions from the protocol, clarifying the response format (e.g., “choose the answer that completes the pattern”), and reminding students that no subject-matter knowledge was required. Examiners were explicitly instructed not to provide hints to the correct answer and not to confirm whether an answer was correct. Assistance procedures were identical across groups to preserve comparability. Written consent has been obtained from a parent or legal guardian of all participants. All participants were briefed on the purpose of the research and were informed that their participation was voluntary. They could withdraw at any time. The study was reviewed and approved by the ethics committee at Bielefeld University (approval number: 2025-256; approval date: 27 August 2025).
2.4. Data Analysis
To address the first research question, an item analysis using item response theory (IRT) was conducted for both test versions. A Rasch model (1PL) was estimated to determine the reliability of expected a posteriori (EAP) and mean weighted likelihood estimation (WLE;
Warm, 1989), item difficulty (
b), and the item-fit values weighted/unweighted mean-square (MNSQ) and z-standardised statistic (ZSTD). Additionally, the Kruder-Richardson 20 formula (KR-20) was used. The unidimensionality of each of the two versions was evaluated using Principal Component Analysis (PCA) of the residuals (
Linacre, 1998), and local dependencies were assessed using Yen’s Q3 method (
Yen, 1984). In both versions, the PCA showed that the eigenvalue of the first residual contrast was <2.0, indicating that the residuals do not form a meaningful secondary dimension and thus support unidimensionality. For the initial version, Yen’s Q3 method showed a raw mean Q3 of −0.04 and an adjusted mean Q3 of less than 0.001. Item-specific analysis showed that all mean absolute Q3 values were <0.20 (Max = 0.112) with a maximum of two violations observed across all item pairs. Similarly, for the second version, Yen’s Q3 method showed a raw mean Q3 of −0.018 and an adjusted mean Q3 of less than 0.001. Item-specific analysis showed that all mean absolute Q3 values were <0.20 (Max = 0.077), with a maximum of two violations observed across all item pairs as well. Given these generally low residual correlations, the assumption of local independence was considered to be sufficiently met in both versions, and unidimensionality was further verified. In addition, uniform differential item functioning (DIF) analyses for the second version of the matrix reasoning test were conducted across the different cohorts (enrichment/control), genders (male/female), and school levels (primary/secondary level). We have decided to use school levels rather than grade levels to align the analysis with the key curricular transition and to better capture children’s progress over time. Grade-by-grade DIF may misattribute expected growth to bias and is influenced by small cell sizes in our data, thereby reducing stability and statistical power (
Penfield, 2001). For this, the Mantel-Haenszel DIF procedure (MH;
Mantel & Haenszel, 1959) with iterative anchor purification was used. The robustness of the results was examined using Lord’s chi-square DIF test (
Lord, 1980) and Raju’s area-based DIF test (
Raju, 1988). The
p-values were adjusted using the Benjamini-Hochberg method (
α = 0.05;
Benjamini & Hochberg, 1995). DIF effect size was evaluated according to
Zwick et al. (
1999): Δ
MH units < 1 indicated a negligible effect, 1 < Δ
MH units < 1.5 indicated a slight to moderate effect, and Δ
MH units > 1.5 indicated a large effect.
For the second research question, the results of the matrix reasoning test, the scientific inquiry assessment, and the IQ were correlated, using Spearman’s ρ (Shapiro-Wilk: p < .05). For this, the person’s ability parameters (θ) were used for the matrix reasoning test and the scientific inquiry assessment. Not-reached responses were recorded as missing and did not affect the likelihood. The correlation analysis was conducted separately for the third and fifth grades to account for age-standardisation of the IQ data.
To answer the third research question, we estimated an exploratory latent regression Rasch model for the two different cohorts (enrichment/control). For further verification of the results, we estimated a multiple-group Rasch model using marginal maximum likelihood with EAP ability estimates. To rule out potential measurement invariance, we employed Stocking-Lord linking (
Stocking & Lord, 1983), assuming complete non-invariance as a robust approach (
Baghaei & Robitzsch, 2025;
Robitzsch & Lüdtke, 2020). This method enables item response modelling with multiple groups across different item sets because the item parameters for both sets are brought onto a common scale. In our case, this allows us to compare both groups (enrichment/control) across both test versions.
4. Discussion
The present study investigated the psychometric properties of contextualised matrix reasoning tasks, featuring biological forms and representations, to assess students’ fluid reasoning abilities in grades 3–6 as part of giftedness identification in biology. In the following, we first discuss the findings from the correlation analysis and the group comparison, before detailing the psychometric properties of the tasks.
For the primary grade level, the correlation analysis indicated a moderate association between
θMR and IQ. The strength of the correlation indicated at least a meaningful overlap of the tested latent abilities. The developed matrix reasoning test showed a moderate convergent validity with fluid intelligence in the primary grade level. The moderate correlation between
θMR and
θSI, and the weaker correlation between
θSI and IQ, suggest that the domain-specific adaptation was effective and that individuals who score higher on SI also tend to score higher in the domain-specific matrix reasoning tasks. These correlations were evident at the secondary grade level. Here, strong correlations were observed between
θMR and IQ, as well as between
θMR and
θSI, indicating substantial overlap and convergence. Consistent with developmental psychology research, MR correlated similarly with IQ and SI at secondary grade levels. For the MR–IQ association, adolescent development in domain-general executive functions and fluid reasoning enhances reliance on cognitive processes fundamental to both measures, such as pattern abstraction, rule induction, and working memory coordination. A shift toward more analytical, rule-based strategies also strengthens the alignment with IQ (e.g.,
Best & Miller, 2010). In the MR–SI correlation, secondary school science teaching develops inquiry skills such as identifying patterns, evaluating evidence, and managing variables, as well as representational fluency, such as interpreting biological diagrams and graphs. Since our MR included biology-based stimuli, the lower content novelty and common analytic demands might further connect MR to SI (e.g.,
Zimmerman, 2007). The higher correlations observed at the secondary level may be due to older students being more experienced in dealing with biological illustrations and common representations. Nevertheless, the newly developed test showed no redundancy with Raven’s Progressive Matrices 2 (
NCS Pearson, 2019). The strong correlation between
θMR and
θSI confirmed the primary-grade results and indicated that adapting the matrix reasoning tasks for biology was successful.
Group comparisons between students in the enrichment program and the control group, estimated using the latent regression Rasch model, indicated that students in the enrichment program outperformed control group students. Accordingly, the measurement direction of the test instrument could be confirmed. In the Rasch latent regression that included cohort (enrichment vs. control) as a binary predictor, the estimated group effect was β = −0.33 logits, corresponding to a standardised mean difference of d ≈ 0.33, which is generally considered small to moderate (
Cohen, 1988). Because the groups were unbalanced (about 75% enrichment and 25% control), the maximum explanation of between-person variance by this contrast was limited. With current proportions, an effect of this size is expected to explain about 2% of the variance (
Rosenthal & Rosnow, 2008). Consistent with this, the model produced R
2θ = 0.03. Small differences are due to sampling variability and the fact that the latent variance was estimated rather than fixed. The estimated multigroup Rasch model confirmed the initial indications of known-groups validity. Nonetheless, to improve the discriminative ability of the test instrument, items should be prepared for adaptive testing in future studies (e.g.,
Weiss & Vale, 1987).
The item analysis showed that the second test version exhibited acceptable item parameters and fit indices. Reliability could be significantly improved by expanding the test instrument from the first to the second version. The reliability of the second version was within an acceptable range but should be improved through further adjustments. The targeting between item difficulty (
b) and person ability (
θ) was improved in the second version. Although the difficulty range was generally well covered, some gaps in item coverage at certain trait levels were still apparent (
Lord, 1980; see
Figure 2). Extreme items might be removed or adjusted to heighten their informational value. Considering their application in giftedness identification, high discriminative power and test information are essential for the valid use of matrix reasoning tasks (
Rost & Sparfeldt, 2017). In the second version item–person targeting was enhanced by expanding coverage of item difficulties throughout the observed ability range (see
Figure 2). However, coverage was less dense at the extremes and sparse in the lower mid-range. In practice, this entails a risk of reaching ceiling performance within the highest-ability subgroup and diminished discrimination for individuals with lower-to-mid abilities. To address these gaps, future revisions should expand the item pool to include very difficult items (
b > 2.5), very easy items (
b < −2), and a small set targeting the lower-mid range (−1.0 <
b < 0) to ensure discrimination across all ability levels. Because of the identified targeting gaps, correlations and group comparisons should be interpreted with caution. These gaps increase the conditional standard error of the mean (SEM) and weaken group differences and correlations (
Lord, 1980). The analysis of fit indices revealed that all items met the infit criterion for high-stakes tests (
Wright & Linacre, 1994). A total of six items showed standardised values outside the criterion for outfit (ZSTD < −1.96), which indicated overfitting. This could enable shortening the instrument and eliminating redundant items. However, these findings require verification through additional studies. A short instrument that does not overwhelm participants would be advantageous for identifying gifted students in schools, particularly among younger age groups. The two items that showed underfitting for the outfit (ZSTD > 1.96) should be monitored in future research, as their increased difficulty may have led participants to guess excessively. Since the infit values met the criterion and difficult items are essential for talent identification to distinguish high-ability individuals, there is no reason to revise these items. For item 3, all three methods identified significant DIF across different cohorts, favouring the control group in the first version. In the second version, no significant DIF was detected for this item. However, this item warrants special attention in future studies, as do items 12, 15, and 27, where non-significant but noteworthy Δ
MH values were observed (see
Table 4). These differences could have methodological causes. Although the instrument was administered according to a set procedure, the cohort of enrichment students was surveyed as part of the project courses. In contrast, the control group students were surveyed during lessons at school. The survey settings revealed differences in group size and atmosphere. The items in question showed extreme b-parameters, indicating they were at the lower or upper ends of the difficulty scale. The direction of DIF varied across items and did not follow a consistent pattern, suggesting that the observed effects are more likely caused by item-level properties than by a consistent group-related bias. Analyzing the cognitive processes and solution strategies associated with these items may provide more detailed explanations and help guide targeted item revisions (e.g.,
Laurence & Macedo, 2023). The developed matrix reasoning tasks still showed negligible or no significant DIF between the cohorts, strengthening the test instrument’s quality and practical applicability. In addition, no indications of DIF were found for gender or grade level. The developed items demonstrated acceptable characteristics for further research.
When interpreting the results of this study, several limitations should be considered. The results of the DIF analysis were obtained using the MH procedure (
Mantel & Haenszel, 1959). In this method, only uniform DIF is assessed. Therefore, Lord’s chi-square test and Raju’s area method were used for robustness checks. Here, three items showed significant values, indicating possible non-uniform DIF. These items should be monitored in future studies. Although the items include contextualised forms, their solution does not require understanding the forms’ content to avoid bias against groups with different levels of knowledge. We lack information about our participants’ ethnic-racial identity and socio-economic background. To ensure the fairness of the developed test among various ethnic and racial groups, this should be examined in future research (
Holden & Tanenbaum, 2023). Furthermore, only academic track schools (Gymnasium) were included among the participating schools. To develop an instrument suitable for all school types and examine potential bias, students from diverse educational backgrounds should be included in future validation efforts. Although both rural and urban schools participated in the study, potential DIF between groups from these areas warrants further investigation. This study depended on existing partnerships, which led to unequal group sizes and a non-probability, clustered sampling method. Though we standardised the administration, residual confounding could remain, and limited precision for smaller groups cannot be entirely excluded. Future research should incorporate stratified recruitment to improve balance and enhance generalisability. For further analyses addressing the second and third research questions, we decided to retain all items in the data pool to obtain a comprehensive picture of the developed items. This procedure may have introduced distortions. Slightly different results would have been obtained through a more rigorous approach and the elimination of marked items. For the correlation analysis, it should be noted that the calculated IQ was derived from the Raven’s Progressive Matrices 2—Short Form (
NCS Pearson, 2019) and did not include person-ability values. Furthermore, a large proportion of students in the correlation analysis sample came from the enrichment program, possibly affecting its results. When calculating the person’s ability parameters (
θ) for the matrix reasoning test and the scientific inquiry assessment, we treated no-response answers as missing data, which were not included in the calculations. This approach reduced information and increased SE(
θ), but it prevented penalising construct-irrelevant non-responses and might have biased the results of the correlation analysis. Furthermore, the reliability of the instrument used to assess abilities in scientific inquiry in the present sample was questionable. When comparing the two cohorts, it is important to note that the students in the enrichment program were chosen by their biology teachers. This selection provides a weak criterion for known-groups validity, as teachers have difficulties in identifying domain-specific gifted students (
Bergold, 2014;
Machts et al., 2016;
Urhahne & Wijnia, 2021). Additionally, the cohorts being compared differ greatly in size, resulting in significantly more comprehensive information for estimating students’ abilities in the enrichment project. The small number of control students limits the significance of the analysis and should be increased in future studies.
The results of the present study provide initial findings that the domain-specific adaptation of matrix reasoning tasks has been successful in the biological domain. Nevertheless, the backgrounds of the effect must be examined more thoroughly in subsequent studies. Experimental contrasts (e.g., contextual versus abstract matrices) should be used to assess the effects of contextualisation on solving strategies, cognitive demands, and student engagement. Additionally, automatically measured process indicators (e.g., response time;
Wise & Kong, 2005) should be examined, and speededness effects should be controlled. To determine whether the stronger correlation with the domain-specific comparison instrument (SI) is attributable to domain-specific skills, knowledge, interests, or increased test-taking motivation, additional test instruments should be used. To enhance the generalisability of the results, it would be appropriate to compare them with matrix reasoning tasks adapted for other STEM subjects or with existing findings in other domains (e.g.,
Benit & Soellner, 2012). To further explore the initial results regarding convergent validity, additional intelligence tests (e.g.,
Wechsler, 2014), measures of fluid intelligence (e.g.,
Weiß, 2006), and measures of domain-specific abilities (e.g.,
Darman et al., 2024;
Greiff & Neubert, 2014) should also be utilised.
In summary, the results of the present study show that the developed item pool exhibited acceptable psychometric properties. Nevertheless, further validation should be conducted in future research to examine the test instrument’s measurement quality more thoroughly and to make additional adjustments. Particular attention should be given to the appropriateness for identifying domain-specific giftedness in school. Self-report, think-aloud, or eye-tracking studies used to explore cognitive strategies could offer deeper insights (
Laurence & Macedo, 2023). Furthermore, future endeavours should examine methodological decisions, such as the inability to revise answers, the lack of response feedback, and the available time (
Frey et al., 2024), to continually optimise the instrument’s design (
Peperkorn & Wegner, 2024;
Euler, 2014;
Shavelson et al., 2003). The development of an item bank that enables adaptive testing and the creation of different test versions can further improve the test’s discriminatory ability and enhance its practical usefulness for giftedness identification procedures with regard to DBR goals (e.g.,
Chierchia et al., 2019;
Pallentin et al., 2023). In identifying gifted students, it is important to consider multiple broad ability domains (e.g.,
McGrew et al., 2023), specific subject skills, creativity, and non-cognitive personality traits (e.g.,
Sternberg, 2024). The findings of the present study support the idea that measuring the broad ability area
Gf with contextualised matrix reasoning tasks can potentially improve identification processes. Future work should broaden the theoretical scope by exploring transfer to other STEM contexts and conducting cross-site studies.