Changing the Order of Factors Does Not Change the Product but Does Affect Students’ Answers, Especially Girls’ Answers

: This study is aimed at exploring how different formulations of the same mathematical item may inﬂuence students’ answers, and whether or not boys and girls are equally affected by differences in presentation. An experimental design was employed: the same stem-items (i.e., items with the same mathematical content and question intent) were formulated differently and administered to a probability sample of 1647 students (grade 8). All the achievement tests were anchored via a set of common items. Students’ answers, equated and then analysed using the Rasch model, conﬁrmed that different formulations affect students’ performances and thus the psychometric functionality of items, with discernible differences according to gender. In particular, we explored students’ sensitivity to the effect of a typical misconception about multiplication with decimal numbers (often called “multiplication makes bigger”) and tested the hypothesis that girls are more prone than boys to be negatively affected by misconception.


Introduction
Differences in mathematical performance between boys and girls have received increasing attention over the years. Although the gap has narrowed over time, the issue is still topical since the differences continue to persist in many countries, as was reported by OECD-PISA (Organisation for Economic Co-operation and Development Programme for International Student Assessment) in 2015: "On average across OECD countries, boys outperform girls in mathematics by eight score points. Boys' advantage at the mean is statistically significant in 28 countries and economies" [1].
Most of the research studies carried out on this topic have used national or international large-scale assessment results and have operationalised gender differences as a reason behind the gap in mathematics test scores observed in relation to the entire test (e.g., [2][3][4][5]). Nevertheless, this perspective merely glances at gender differences, providing a snapshot of the gap between genders at some point or relating gender differences to other factors such as background and metacognitive aspects but failing to provide didactic information about the nature of these differences (differences that usually disadvantage girls more than boys), or explaining whether these differences are typically related to just some items or may concern all the test items. In this direction, part of the literature explores gender differences in relation to specific sub-domains of mathematical ability (for example, arguing that boys outperform girls in spatial ability and, more generally, in geometry items; e.g., [6,7]), other works at item level find a correlation between item difficulty and gender differences (e.g., [8,9]), and, finally, some studies examine the influence of item type in relation to gender (for instance, showing that boys outperform girls in multiplechoice items rather than constructed-response items, in which girls display better results; e.g., [10][11][12][13]). Less research has been carried out on possible relationships between task formulation and gender differences in relation to specific items, especially from a didactic perspective, considering the didactic milieu either as involved in the causes, or as an actor participating in the resolution. Let us state explicitly that throughout this paper, we use the term "gender" to indicate the result of the boys/girls classification used in the reports of the entities which have performed the studies, and the official registration of pupils used in Italian schools-it is a registry classification.
To explore these possible relationships, starting from a mathematics achievement test developed by the Italian National Institute for the Evaluation of Educational System (hereafter, INVALSI-Istituto Nazionale per la Valutazione del Sistema di Istruzione e Formazione) to measure students' ability in math at grade 8, we implemented an experimental plan. We prepared four booklets sharing some items, which are identical in all the booklets and compose the Core Test, while the remaining items are the same stem-items (i.e., items with the same mathematics content and the same question intent) formulated differently in each booklet. These variations were constructed to test specific hypotheses from mathematics education to explore if, and how, different formulations (mis)lead students' answers (and possibly problem-solving strategies), and subsequently to verify if and how this mechanism interplays with students' features (such as, for example, gender). In contrast to most of the current literature based on gender differences displayed over the entire test, as previously recommended, for example, by [14], we explored gender differences at item level, i.e., comparing the probability of encountering each item successfully by boys and girls matched on ability, via Rasch differential item functioning analysis (DIF). When we use the term "ability", related to the INVALSI test or to our experiment, we mean the latent trait measured globally by the test.
Our methodological strategy is two-fold: it tests didactic hypotheses about students' strategies and gathers some information about the effect of formulation and the activation of certain cognitive strategies over others, thus providing information about the relationship between formulation and item functionality from a psychometric point of view.
In this paper, we present an example of the analysis we carried out. Specifically, we explore the effect of a typical misconception about the multiplication of decimal numbers [15]. This misconception, often called "multiplication makes bigger" [16] (p. 37), emerges during the transition between natural numbers and decimal numbers: when operating with natural numbers, students see that the result of a multiplication is bigger than its factors, and suddenly, they begin to think that this property of multiplication is also true when they multiply rational numbers. This misconception leads students to also make mistakes in secondary school [15]. Previous studies have already proven that, in Italy, girls tend to conform their problem-solving strategies to didactic practices more than boys and could thus be more prone to the negative effect of misconceptions [17][18][19][20][21]. More generally, it is known that factors strictly related to the didactic choices of the teacher and of the school system have an impact on gender gap in mathematics. For instance, curriculum variables [22], teaching methods [23], different assessment practices [24], and factors related to achievement goals [25] are determinant in the emergence of gender differences in mathematical performance and that misconceptions in mathematics are related to intuitive models created during the didactic dialectic in the classroom [26]. The misconception we are exploring is evidently related to the model of multiplication as repeated addition, and hence to the order of the factors. In Italian, the first factor is the quantity to be multiplied, and the second factor is the "number of times", whilst in other languages (such as German), it is the opposite.

Variation in the Formulation of a Mathematical Task
When students tackle a mathematics item, their answers are always influenced by the formulation of the item itself. Much research in mathematics education has studied the influence of the formulation specifically in mathematics word problems (e.g., [27]). Even minor changes in the formulation of a problem can affect students' answers. Many previous studies have already proven that the effects of variations in a task formulation are not simply related to linguistic formulation (in the case of word problems) but also to other variables such as data, context, and the operation involved. Nescher [28] proposed three categories of possible variations in a word problem: logical (operations involved, or lack or abundance of data), syntactic (number of words of the text, position of the question), and semantic (contextual relations and implicit suggestions). Duval [29] classified all these modifications as redactional variables, which influence students' cognitive and operative processes. Laborde [30] used this term to also include non-verbal changes, such as modification of figures or the position of the figures in relation to the text. A recent literature review on this issue [31] considered how linguistic variations as well as other kinds of changes influence students' responses and problem-solving strategies [32][33][34]. Daroczy [31] listed three main components that can alter the difficulty of a task, i.e., "(1) the linguistic complexity of the problem text itself, (2) the numerical complexity of the arithmetic problem, and (3) the relation between the linguistic and the numerical complexity of a problem" (p. 348). We may consider that even in a purely arithmetic task, the formulation may link it to intuitive models, and the usual contexts of use of the operations, which may affect its complexity.
For the purposes of this paper, we used the same question, "What is the result of 4 × 0.5?", previously administered by Sbaragli [15] (p. 124) but transformed into a multiplechoice item. In addition to the original form, we also administered another form with the same, but reversed, factors (0.5 × 4). We hypothesise that this change decreases the numerical complexity of the arithmetic problem because, in contrast to the original form, it suggests performing the multiplication following the intuitive model of repeated addition, with no conflict with the result, which is indeed higher than the first factor. This hypothesis is related to the fact that the students of our sample are Italian and in Italian, the first factor is the quantity to be multiplied, and the second factor is the "number of times". Moreover, this change in item formulation might increase item functionality.
In other words, our hypothesis is that the first formulation activates the misconception to a larger extent, and that this activation is stronger in girls.

Misconceptions and Decimal Numbers
During the early years of primary school, students learn natural numbers, their properties, and how to operate with them. The introduction of rational numbers is a complex phase and many difficulties emerge, primarily because rational numbers can be represented using different semiotic registers (e.g., fractions, decimal numbers, graphic representations). The literature shows that when students begin operating with decimal numbers, they have to overcome many obstacles [35,36].
The word misconception has been used with different meanings in the educational field [35], often as a synonym of "mistake" or "misunderstanding". Brousseau [37] linked misconceptions to the concept of "obstacle": during the formation of a mathematical concept, one idea that was useful earlier for solving problems can become an obstacle if students extend this idea to new problems where it is inappropriate. The mistake is due not to a lack of knowledge but to a previous knowledge that is incorrect in a more general context. When students study a new concept, they create an "intuitive model" of this concept [38] based on their primary experiences, but this model could be closer to the previous (more elementary) concept learned by the students in the past than to the complete mathematical concept, thus misleading students' problem-solving strategies.
When students learn natural numbers, they also learn properties, algorithms, and operations and, on this basis, create intuitive models of these concepts. Misconceptions related to this transition emerge, for example, when students compare decimal numbers and state that 0.12 is bigger than 0.2 just because 12 is bigger than 2 [15]: in this case, students compare the decimal part of the two numbers as if they were natural numbers. Moreover, they often do not consider 0.2 as 0.20 because, also in this case, they are influenced by the idea (correct in natural numbers but wrong with decimals) that adding zero at the end of the number is equal to multiplying it by 10.
The premature creation of intuitive models, indeed incomplete, and the persistence of these models leads students to make mistakes and generate "parasite" models [39]. In this paper, we adopt the following definition of misconception: a concept which is temporarily incorrect, awaiting re-elaboration in a more elaborated and critical cognitive system [39,40]. We focus on the misconception related to decimal numbers, according to which the result of a multiplication is always bigger than factors multiplied. This misconception has been widely studied in the literature and is usually called "multiplication makes bigger" [16,35,41]. It refers to the premature formation of a conceptual (intuitive) model of multiplication when students operate exclusively with natural numbers. When students learn multiplication, they use natural numbers, and then they observe that the product of two numbers (excluding 0 and 1) is always greater than its factors. This leads them to believe that the "rule" that "multiplication makes bigger" applies to both natural numbers and decimal numbers, although this is not actually true.
D'Amore and Sbaragli [35] interviewed students of different grades asking them "What is the result of 4 × 0.5?". The same question addressed to students attending primary, lower intermediate, and even secondary school was answered in a similar manner (i.e., 8) confirming that the mistake is due to the persistency of the misconception explained above. Our hypothesis is that we can correct the misconception by saying "multiplication makes the first factor bigger" (of course, in the Italian system).

Research Questions
Gender differences in mathematics test performance are explained in many studies by social and cultural factors (e.g., [3]) but also by metacognitive factors, such as a higher level of mathematics anxiety for girls and less self-confidence (e.g., [1]). These factors are also strictly related to the classroom environment, and previous studies based on INVALSI data showed that girls are more influenced by didactic practices, classroom routines, and the teacher-student relationship than boys, which makes them more prone to the (mis)leading effect of misconceptions and didactic contract [17][18][19][20][21]42].
A recent study argued that girls have more difficulties in solving items in which there is the influence of misconceptions on decimal numbers [17,19]. In particular, analysis of items that required comparison between decimal numbers showed that, when students work with decimal numbers with the same integer part (for example, 80.12 and 80.2), girls are more likely than boys to compare directly the decimal part of the two numbers and state that 80.12 is bigger than 80.2, probably considering that 12 is bigger than 2, rather than lower than 20.
Following these results, in this research, we study the previously described misconception, according to which the result of a multiplication is always bigger than its factors [15,16,41]. In order to explore this phenomenon, we compared two versions of the same stem-item; the first formulation was studied previously by Sbaragli in 2012 [15] via qualitative methods (Table 1). In the second formulation, we simply reversed the order of the factors, in order to quantify the possible effect of the misconception described above from a gendered perspective. This variation was implemented in order to understand whether the misconception "multiplication makes bigger" is connected with both factors of the product (the result is bigger than both factors) or mostly to one of the two factors (i.e., the result is bigger than the first factor).
Our research questions are: 1. Does the misconception "multiplication always increases" have a different influence on boys and girls in terms of differential item functioning? 2.
Does reversing factors (e.g., 4 × 0.5 in place of 0.5 × 4) have an impact on students' answers and item functionality? 3.
Does this variation have a different effect on boys as compared with girls?

Data
A probability sample (2000 students attending grade 8), stratified by students' region of residence and socioeconomic (SES) background, was drawn from the entire list of schools located in Campania, Emilia-Romagna, Lazio, and Lombardy (four regions very representative of students' ability in the south, centre, and north of Italy, respectively, according to INVALSI national surveys- Figure 1). After data cleaning, the sample size equalled 1647 students, a number consistent with the Rasch equating design (roughly 400 students per form [43]). To measure students' SES, the SC-index [45], based on the combination of highest parental education and professional status, was used. Individual SES data were aggregated at school level to measure overall school SES composition. The proportion of low-, medium-, and high-SES schools in our sample is similar to that in the annual INVALSI sample [44].

Materials
Using a mathematics achievement test developed by INVALSI as a starting point, three more achievement tests were developed. An experimental design was employed: all To measure students' SES, the SC-index [45], based on the combination of highest parental education and professional status, was used. Individual SES data were aggregated at school level to measure overall school SES composition. The proportion of low-, medium-, and high-SES schools in our sample is similar to that in the annual INVALSI sample [44].

Materials
Using a mathematics achievement test developed by INVALSI as a starting point, three more achievement tests were developed. An experimental design was employed: all mathematics tests contained the same stem-items, i.e., items with the same mathematical content and the same question intent, but with a different formulation from one test to another. Item phrasing was modified by means of syntactic variations, different figures, the effect of mathematics, and/or real context.
In this paper, we analyse item D9, included in booklets F1 and F2 with the same formulation, and in F3 and F4 with altered phrasing. Variation was developed to explore the effect of misconception about multiplication with decimal numbers. Table 2 shows the composition of each of the four booklets, the name of each item reporting the year of its inclusion in the INVALSI tests, and for the varied forms, we added "_original" or "_v" to specify in which booklet we included the original form or a varied form (different grey scale in the same row indicates different versions of a stem-item). In the first column, we indicate with "A" and "Anch" the items used to perform two anchoring strategies: (1) A first set of anchoring items was put at the beginning of the test in order to avoid the fatigue effect and offer an external anchoring strategy; these items are indicated by "A-". (2) A second set of anchoring items was included in the achievement test (in the same position across tests) and used as an internal anchor, indicated with "Anch-" (see Appendix A). Table 2. Composition of the four booklets.

Item
Booklet 1 Booklet 2 Booklet 3 Booklet 4 Items in white and labelled with "A" or "Anch" are anchored items; grey items labelled with "D" are items included with different formulations.

Analytic Strategy
The four mathematics achievement tests developed for the purposes of our research (named F1, F2, F3, and F4, respectively) were administered by means of a spiralling process (according to which different forms are administered to different students within each classroom) to randomly assign forms to students in the same classroom. Regarding the spiralling administration process: "When using this design, the difference between group level performance on the two forms is taken as a direct indication of the difference in difficulty between the forms" [43] (p. 13) and thus is sufficient to render answers given by different subgroups of students comparable.
Nonetheless, to guarantee the comparability of answers provided by different students to the different versions of the same item (item D19), we scaled all students and all items from each achievement test along the same latent trait (i.e., mathematics ability) by anchoring our four mathematics achievement tests and then by equating them [43]. The process of equating is used in situations where scores earned on different forms need to be compared to each other. Within the Rasch framework [46], the process of equalising forms is used to construct a common scale and thus to put both students and items along the same latent trait, making them directly comparable.
In a recent study, Kopf, Zeileis, and Strobl [47] (p. 84) claimed that "The minimum (necessary but not sufficient) requirement for the construction of a common scale in the Rasch model is to place the same restriction on the item parameters in both groups [48]. The items included in the restriction are termed anchor items". Since the statistical power of anchoring increases with the length of a DIF-free (i.e., showing no differential functioning depending on students' features- [49]) anchor [50][51][52], we input two sets of anchor items: The first set of (eight) anchoring items was put at the beginning of the test in order to avoid a fatigue or learning effect, and then used for external anchoring; the other (eight) anchoring items were interspersed in between, through the test (in the same position across tests) and used as internal anchor. The first and the second set of anchor items were used as external and internal anchors in two separate calibration processes. Results from these anchoring strategies are consistent. Finally, both sets of items were used all together to perform a concurrent calibration to equate tests by using RUMM2030.
Having equalised the tests, a differential item functioning (DIF) analysis within the framework of the Rasch analysis was carried out to understand if, and how, misconception affects boys and girls differently.
The Rasch model is particularly suitable to pursue these aims as it grounds on three assumptions: (i) local independence (i.e., people's reactions to each item is independent from the reaction to all the other items); (ii) equal item discrimination (i.e., higher ability respondents are more likely to encounter each item successfully); and (iii) unidimensionality (i.e., a single common trait explains the item responses). To assess data-model fit, we preliminarily explored infit and outfit statistics, i.e., "mean-square fit statistics defined such that the model-specified uniform value of randomness is indicated by 1.0 [53]" (p. 9), with tolerable standard deviations around 0.20 [54]. Nonetheless, in line with previous studies (e.g., [55]) we took 1.3 as a value for infit and outfit mean squares that suggests cause of concern.
When these properties hold, Rasch parameters are invariance, i.e., they do not change across sub-group of students with the same level of ability. In contrast, violation of parameter invariance may be discovered by investigating the so-called differential item functioning (DIF; e.g., [56]). The DIF occurs when subjects matched on the same ability level have a different probability of encountering an item successfully. DIF refers to each single item and to item behaviour in a sub-group of students matched on ability and clustered by one personal student attribute (gender, in this study).
RUMM2030 compares the items' response function (IRF) that links the probability of a correct answer to student ability, for boys and girls separately. In fact, when a statistically significant DIF occurs, measurement invariance is violated, and thus, ''different item characteristics curves occur in subgroups" of students [47] (p. 83). In addition, we compared distractor response curves (DRC) drawn for boys and girls, separately. Distractor analysis is very informative because it provides a visual interpretation of response patterns for the set of distractors associated with each multiple-choice item. It allows examination of whether the differential selection of incorrect choices (distractors) attracts various groups in different ways (i.e., if any pattern is present in the proportion of responses across the different class intervals for each distractor against the IRF), thus identifying potential sources of construct-irrelevant variance. In addition, by comparing each distractor function, it was possible to examine whether variables other than a student's ability affect the content of only a single or all distractors. In addition, to assess statistically significance of gender differences, we reported on a factorial analysis of variance (ANOVA) on the class interval (factor 1) and person-level factors (factor 2).

Results
The following two sections report on DIF analysis by gender and on interpretation of output.

Differential Item Functioning by Gender
After having verified the goodness of data model fit (see results in the Appendix A), the item parameter estimate indicates the relative difficulty of item D9 and thus its location along the latent trait, a graded continuum where zero indicates a medium difficulty level, and thus, negative values indicate relatively easy items, whereas positive values indicate relatively more difficult items.
We provided a visual display of the set of observed means for each person level factor (i.e., for boys and girls) across each of the class intervals present in the item-trait test-of-fit specifications. Each level is plotted in relation to the item characteristic curve, i.e., the theoretical curve estimated by the Rasch model, according to which no factor other than students' intrinsic ability can affect the probability of encountering an item successfully. Finally, we reported the distractor plots drawn for boys and girls, separately, in order to explore their answer behaviour in relation to each answer option.
Boys outperform girls in relation to item formulation in booklet F1 (4 × 0.5) especially at the upper tail of the latent trait (i.e., among more talented students; Figures 2 and 3), although it shows both a non-significant interaction (p = 0.739) and a non-significant gender main effect (p = 0.148) (α = 0.05-See Appendix A). Distractor analysis shows different students' approach to this item by gender-Option B is more attractive for low-ability boys than girls. Little difference can be found regarding options A and D.
Educ. Sci. 2021, 11, x FOR PEER REVIEW 9 of 18 analysis is very informative because it provides a visual interpretation of response patterns for the set of distractors associated with each multiple-choice item. It allows examination of whether the differential selection of incorrect choices (distractors) attracts various groups in different ways (i.e., if any pattern is present in the proportion of responses across the different class intervals for each distractor against the IRF), thus identifying potential sources of construct-irrelevant variance. In addition, by comparing each distractor function, it was possible to examine whether variables other than a student's ability affect the content of only a single or all distractors. In addition, to assess statistically significance of gender differences, we reported on a factorial analysis of variance (ANOVA) on the class interval (factor 1) and person-level factors (factor 2).

Results
The following two sections report on DIF analysis by gender and on interpretation of output.

Differential Item Functioning by Gender
After having verified the goodness of data model fit (see results in the Appendix A), the item parameter estimate indicates the relative difficulty of item D9 and thus its location along the latent trait, a graded continuum where zero indicates a medium difficulty level, and thus, negative values indicate relatively easy items, whereas positive values indicate relatively more difficult items.
We provided a visual display of the set of observed means for each person level factor (i.e., for boys and girls) across each of the class intervals present in the item-trait test-offit specifications. Each level is plotted in relation to the item characteristic curve, i.e., the theoretical curve estimated by the Rasch model, according to which no factor other than students' intrinsic ability can affect the probability of encountering an item successfully. Finally, we reported the distractor plots drawn for boys and girls, separately, in order to explore their answer behaviour in relation to each answer option.
Boys outperform girls in relation to item formulation in booklet F1 (4 × 0.5) especially at the upper tail of the latent trait (i.e., among more talented students; Figures 2 and 3), although it shows both a non-significant interaction (p = 0.739) and a non-significant gender main effect (p = 0.148) (α = 0.05-See Appendix A). Distractor analysis shows different students' approach to this item by gender-Option B is more attractive for low-ability boys than girls. Little difference can be found regarding options A and D.   These differences are more notable in F2 (Figures 4 and 5), with a clear advantage of boys over girls, especially at the upper end of the latent trait, with a statistically significant main gender effect (p = 0.001) (α = 0.05-See Appendix A). Nevertheless, response patterns for the set of distractors associated with D9 administered in the booklet F1 and D9 administered in the booklet F2 are naturally similar. The main difference relates to distractor D, which is much more attractive for girls (especially at the bottom of the ability distribution) than for boys. Moreover, high-ability boys are not attracted by any distractor.  These differences are more notable in F2 (Figures 4 and 5), with a clear advantage of boys over girls, especially at the upper end of the latent trait, with a statistically significant main gender effect (p = 0.001) (α = 0.05-See Appendix A). Nevertheless, response patterns for the set of distractors associated with D9 administered in the booklet F1 and D9 administered in the booklet F2 are naturally similar. The main difference relates to distractor D, which is much more attractive for girls (especially at the bottom of the ability distribution) than for boys. Moreover, high-ability boys are not attracted by any distractor.
tractor plot of boys (bottom left) and girls (bottom right)-item D9, booklet F1. Source: our elaboration. Note: as been split by gender. The figures above are the ICC plotter for boys (on the left) and girls (on the right). elow are the distractor plots for boys (on the left) and girls (on the right).
These differences are more notable in F2 (Figures 4 and 5), with a clear advantage of boys over girls, especially at the upper end of the latent trait, with a statistically significant main gender effect (p = 0.001) (α = 0.05-See Appendix A). Nevertheless, response patterns for the set of distractors associated with D9 administered in the booklet F1 and D9 administered in the booklet F2 are naturally similar. The main difference relates to distractor D, which is much more attractive for girls (especially at the bottom of the ability distribution) than for boys. Moreover, high-ability boys are not attracted by any distractor.   The analysis of answers to the item D9 administered in booklet F3 confirms an overall advantage of boys over girls (Figures 6 and 7), statistically significant in relation to the main gender effect in F3 (p = 0.027) and in relation to the interaction in F4 (p = 0.049) (α = 0.05-See Appendix A). The differences between boys and girls located at the bottom of the ability distribution are particularly interesting (Figure 8): from −1.5 to −0.5 logit, all differences are in favour of boys, as also partially confirmed by the analysis of F4. In both cases, distractor analysis shows interesting dissimilarities (Figure 9). The most interesting differences between boys and girls can be observed in F4. Option B is slightly more attractive for boys than for girls, while option D is much more attractive for low-ability girls.  The analysis of answers to the item D9 administered in booklet F3 confirms an overall advantage of boys over girls (Figures 6 and 7), statistically significant in relation to the main gender effect in F3 (p = 0.027) and in relation to the interaction in F4 (p = 0.049) (α = 0.05-See Appendix A). The differences between boys and girls located at the bottom of the ability distribution are particularly interesting (Figure 8): from −1.5 to −0.5 logit, all differences are in favour of boys, as also partially confirmed by the analysis of F4. In both cases, distractor analysis shows interesting dissimilarities (Figure 9). The most interesting differences between boys and girls can be observed in F4. Option B is slightly more attractive for boys than for girls, while option D is much more attractive for lowability girls. The analysis of answers to the item D9 administered in booklet F3 confirms an overall advantage of boys over girls (Figures 6 and 7), statistically significant in relation to the main gender effect in F3 (p = 0.027) and in relation to the interaction in F4 (p = 0.049) (α = 0.05-See Appendix A). The differences between boys and girls located at the bottom of the ability distribution are particularly interesting (Figure 8): from −1.5 to −0.5 logit, all differences are in favour of boys, as also partially confirmed by the analysis of F4. In both cases, distractor analysis shows interesting dissimilarities (Figure 9). The most interesting differences between boys and girls can be observed in F4. Option B is slightly more attractive for boys than for girls, while option D is much more attractive for low-ability girls.

The Interpretation of Empirical Results from a Didactic Point of View
The results reported above show interesting group differences. The graphical exploration and comparison of the ICCs and distractors as well as the comparison of item difficulty estimated for boys and girls provide some interesting elements that help to answer our research questions: misconception negatively affects the probability of encountering an item exploring students' ability in multiplying decimal numbers, and it affects girls more negatively than boys. The first item formulation (4 × 0.5) reveals, both in F1 and F2, an advantage for boys, especially at the bottom and top of the ability distribution.

The Interpretation of Empirical Results from a Didactic Point of View
The results reported above show interesting group differences. The graphical exploration and comparison of the ICCs and distractors as well as the comparison of item difficulty estimated for boys and girls provide some interesting elements that help to answer our research questions: misconception negatively affects the probability of encountering an item exploring students' ability in multiplying decimal numbers, and it affects girls more negatively than boys. The first item formulation (4 × 0.5) reveals, both in F1 and F2, an advantage for boys, especially at the bottom and top of the ability distribution.
Comparison between the two versions of the task reveals that the order of the factors in multiplication has a strong influence on students' answers. In particular, if the multiplication posits the decimal number as the second factor (4 × 0.5), the task is more difficult than if the decimal number is presented first (0.5 × 4). This might be due to the fact that students are influenced by the intuitive model [38] of multiplication as a repeated sum, and in the second form, it is more immediate, for Italian students, to consider 0.5 × 4 = 0.5 + 0.5 + 0.5 + 0.5. Therefore, the main finding is that the inversion of the two terms of a multiplication has a huge impact on students' behaviour, especially on girls: a stronger gender gap emerges in favour of boys in the first version (4 × 0.5) and in favour of girls in the second version (0.5 × 4) for the lower tail. This result is coherent with the fact that in the second version, it is easier, especially for struggling students, to tackle the task using the implicit model of multiplication as repeated addition, and this particularly helps girls of lower-ability levels.

Limitation of the Present Study
The DIF analysis revealed gender differences, not always statistically significant. This could be a limitation of the present study because results presented in this paper cannot be inferred to the entire student population. Nonetheless, it is worth noting that the Rasch model is based on the assumption that the probability of encountering an item successfully is related to students' relative ability, i.e., their ability compared with item difficulty, and that no other variables (e.g., students' individual features) can affect it. Therefore, even though a moderate item misfit does not need to necessarily be interpreted as a limitation (of the test or even of the choice of the model), but as a potential source of information (as recently argued in [57]), test items are constructed by INVALSI to be DIF-free. Similarly, the materials we developed for the purposes of the present research were constructed to be DIF-free with just a few exceptions aimed at testing specific hypotheses Comparison between the two versions of the task reveals that the order of the factors in multiplication has a strong influence on students' answers. In particular, if the multiplication posits the decimal number as the second factor (4 × 0.5), the task is more difficult than if the decimal number is presented first (0.5 × 4). This might be due to the fact that students are influenced by the intuitive model [38] of multiplication as a repeated sum, and in the second form, it is more immediate, for Italian students, to consider 0.5 × 4 = 0.5 + 0.5 + 0.5 + 0.5. Therefore, the main finding is that the inversion of the two terms of a multiplication has a huge impact on students' behaviour, especially on girls: a stronger gender gap emerges in favour of boys in the first version (4 × 0.5) and in favour of girls in the second version (0.5 × 4) for the lower tail. This result is coherent with the fact that in the second version, it is easier, especially for struggling students, to tackle the task using the implicit model of multiplication as repeated addition, and this particularly helps girls of lower-ability levels.

Limitation of the Present Study
The DIF analysis revealed gender differences, not always statistically significant. This could be a limitation of the present study because results presented in this paper cannot be inferred to the entire student population. Nonetheless, it is worth noting that the Rasch model is based on the assumption that the probability of encountering an item successfully is related to students' relative ability, i.e., their ability compared with item difficulty, and that no other variables (e.g., students' individual features) can affect it. Therefore, even though a moderate item misfit does not need to necessarily be interpreted as a limitation (of the test or even of the choice of the model), but as a potential source of information (as recently argued in [57]), test items are constructed by INVALSI to be DIFfree. Similarly, the materials we developed for the purposes of the present research were constructed to be DIF-free with just a few exceptions aimed at testing specific hypotheses about how gender interplays with item characteristics. Nonetheless, only three items were constructed to explore gender differences (D9, D15, and D16, aimed at exploring misconceptions or the effect of the item's context-i.e., real or mathematical-on students' solving strategy). The absence of a statistically significant DIF is thus an unavoidable and inherent consequence of the tests' construction process.
Moreover, even though our analysis revealed some differences in students' answers to the item in F1 and F2, and to the item in F3 and in F4, it is worth noting that results between F1 and F2 are consistent, as are those between F3 and F4, thus supporting our results' interpretations about the diverse effect of the misconception analysed in this paper on girls' and boys' answers.
Results presented in this paper showed that traditional psychometric tools, and in particular the graphical inspection of the ICCs and of distractor plots, are extremely valuable in exploring in-group differences, since all the graphs compare students matched on ability. Moreover, in this research, such graphs were constructed after having equated mathematics achievement tests, thus making students' answers directly comparable. Working within the framework of the Rasch analysis is an added value of the present study: the equating strategy performed here guarantees the comparability of students' answers across mathematics achievement tests and across sub-groups of students (whichever way they are defined), thus offering a methodological approach that can be used also to pursue other research goals.
Our analyses showed a different effect of a specific misconception (related to multiplication with decimal numbers) on boys' and girls' answering behaviour. The misconception investigated here was already studied from a qualitative point of view by D'Amore and Sbaragli [35]. Consistently with Sbaragli [15], our results showed that girls' difficulty in multiplying decimal numbers is due to the misconception, as also confirmed by distractor D, which is strictly related to the misconception and is more attractive for girls than for boys.
The inversion of multiplication factors misleads students' answers, with a stronger influence on girls than on boys. Moreover, compared to previous studies about students' misconceptions in multiplying decimal numbers, the use of the Rasch model adds some advantages to the investigation of this topic. Firstly, if DIF is detected, the results can be interpreted in terms of which items are easier or harder to solve for which group [47]. This offers interesting elements to enrich the debate from a didactic point of view: previous studies carried out in Italy have shown that girls are more influenced by didactic practices, classroom routines, and the teacher-student relationship than boys, and that this makes them more prone to the (mis)leading effect of misconceptions and didactic contract [16,38]. Moreover, such strong differences between boys and girls at grade 8 in Italy are quite unusual: as systematically reported by INVALSI in its national annual reports, gender differences increase over time, from primary to secondary school, but at grade 8, they tend to be close to zero (e.g., [44]). Understanding such a result deserves much more investigation that is beyond the scope of this study.
Results presented in this paper help us to explain why and how the exploration of gender gap at item level, rather than across the entire test, can contribute further information to the current debate about gender differences. In this direction, for example, Leder and Lubiensky [14] (p. 35) stated that: Item-level analyses can pinpoint the mathematics that students do and do not know, including which problems most students can and cannot solve, and which problems have the largest disparities between groups. This information can inform both textbook writers and teachers, as they strive to address curricular areas in need of additional attention. Hence, it is important for item-level analyses to be systematically conducted and reported.
In this paper, we combined traditional psychometrical tools with the theoretical lens of mathematics education to test specific hypotheses about students' problem-solving strategies. This comparison, based on a large probability sample consisting of 1647 students attending grade 8, was made on the analysis of students' answers to four anchored mathematics tests developed for the specific purposes of the present study. A common-item non-equivalent group design was employed to collect data, and all forms were equalised to enable comparable answers from the different subgroups of students: "When using this design, the difference between group level performance on the two forms is taken as a direct indication of the difference in difficulty between the forms" [39] (p. 13).
The combination of traditional psychometric tools with the theoretical lens of mathematics education, an unprecedented strategy for the exploration of gender differences, adds real value to the current debate about gender differences because it provides critical information about boys' and girls' performances and hence suggests research paths about their problem-solving strategies. Gender differences emerge on specific mathematics content, and these results are consistent with the current literature on gender differences in mathematics: many studies highlight that differences between boys and girls can be explained by a different use of learning and problem-solving strategies rather than differences in cognitive abilities. If we consider problem-solving activities in mathematics, for instance, girls more frequently use routine procedures and well-known algorithms, while boys are more inclined to try new methods and non-conventional approaches [58][59][60]. The analysis of gender difference in items related to specific difficulties and constructs already studied in mathematics education research could be fruitful also for teachers. The more we investigate and understand these differences, the more teachers will have opportunities to intervene with specific didactical activities. In particular, regarding misconception, teachers must be aware of avoidable and unavoidable misconception [15]. The first ones are linked to didactical practices and teachers' choices; the second ones are unavoidable because they are not due to didactical transposition but are temporary and not exhaustive ideas due to the necessary gradual introduction of new mathematical knowledge. In this paper, we compared two versions of the same item with the purpose of analysing a specific misconception concerning multiplication with decimal numbers. Misconceptions related to decimal numbers are considered unavoidable misconceptions: they arise from the fact that students learn mathematical operation in the field of natural numbers. Teachers must be conscious of students' difficulties in the transition between natural and rational numbers: they need to ensure that ideas related to mathematics operations in natural numbers do not become "parasite" models [15] when students have to face the same operation with rational numbers. In particular, this study suggests to teachers to pay special attention to girls because they are more influenced by these misconceptions. In our task, we observe that differential item functioning is related to misconceptions and intuitive models of multiplications used by students, but the influence of these factors is different for boys and girls. This is further confirmed by variation in item functionality due to variation in item formulation: 0.5×4 favours lower-ability girls by offering an "easier" formulation which activates a routine procedure (intuitive model of multiplication as repeated addition).
This paper gives a contribution in the direction indicated by [61]: a theoretically driven interpretation of macrophenomena highlighted quantitatively by Large-Scale Assessments may help in clarifying solid findings in Mathematical Education.