Can Generative Artificial Intelligence Effectively Enhance Students’ Mathematics Learning Outcomes?—A Meta-Analysis of Empirical Studies from 2023 to 2025

Liu, Baoxin; Zhang, Wenlan; Wang, Fangfang

doi:10.3390/educsci16010140

Open AccessSystematic Review

Can Generative Artificial Intelligence Effectively Enhance Students’ Mathematics Learning Outcomes?—A Meta-Analysis of Empirical Studies from 2023 to 2025

by

Baoxin Liu

,

Wenlan Zhang

^*

and

Fangfang Wang

Faculty of Education, Shaanxi Normal University, Xi’an 710062, China

^*

Author to whom correspondence should be addressed.

Educ. Sci. 2026, 16(1), 140; https://doi.org/10.3390/educsci16010140

Submission received: 9 December 2025 / Revised: 10 January 2026 / Accepted: 14 January 2026 / Published: 16 January 2026

(This article belongs to the Special Issue Teacher Emotion, Pedagogical Agents and Psychological Well-Being: Implications for Student Development)

Download

Browse Figures

Versions Notes

Abstract

Generative artificial intelligence (GenAI) shows transformative potential in mathematics education. However, empirical findings remain inconsistent, and a systematic synthesis of its effects across distinct engagement dimensions is lacking. This preregistered meta-analysis (INPLASY2025110051) systematically reviewed 22 empirical studies (46 independent samples, N = 5232) published between 2023 and 2025. The results indicated that GenAI has a moderate positive impact on students’ mathematics learning outcomes (g = 0.534). Moderation analysis further revealed that the level of GenAI integration in teaching, sample size, and learning content are the primary factors influencing this effect. The study found that the effect was most pronounced under the creative transformation (CT) integration mode, was significant when applied to geometry learning, and was stronger in studies with small samples or small class sizes; collaborative learning approaches also significantly enhance these mathematics learning outcomes. By contrast, educational stage and intervention duration did not show significant moderating effects. The GRADE assessment indicated that while the overall evidence is supportive, the certainty of evidence is stronger for cognitive outcomes than for non-cognitive domains. The findings also offer a reference for future research on constructing a human–machine collaborative learning environment.

Keywords:

GenAI; mathematics learning; meta-analysis; learning outcomes; cognitive skills; non-cognitive skills

1. Introduction

Mathematics, as a fundamental discipline, occupies a central position in the cultivation of innovative talent. Its role as the foundation for advances in science, engineering and digital technologies has been widely recognised. However, the subject’s abstract nature and rigorous logic often present substantial challenges for students, giving rise to problems such as learning anxiety and reduced motivation (Passolunghi et al., 2020). The information age has arrived. Consequently, contemporary mathematics education can no longer focus solely on knowledge transmission. Instead, it must cultivate students’ critical thinking, creativity, and self-regulated learning. The emergence of GenAI has provided a new route for addressing these challenges and for realising that transformation of educational aims. GenAI tools generate text, step-by-step explanations, mathematical proofs and visual content, and they have transformed the potential for creating personalised, interactive mathematics learning environments (Kasneci et al., 2023; Wardat et al., 2023). This gives them transformative potential in mathematics education, supporting students’ active knowledge construction and providing scaffolding in collaborative learning. However, empirical evidence regarding the effectiveness of GenAI in mathematics education has been contradictory and inconsistent. Some studies have affirmed its positive role in providing immediate feedback and personalised tutoring, which has enhanced students’ problem-solving abilities and learning motivation (Walkington, 2025). Conversely, other studies have reported that excessive reliance on GenAI can reduce students’ capacity for deep thinking and may negatively affect the long-term development of mathematical skills (Kim et al., 2024). Moreover, some empirical results have shown no significant effects. Therefore, this study adopts a meta-analysis to investigate systematically the impact of generative artificial intelligence on students’ mathematics learning outcomes and to explore potential moderators, aiming to provide practical guidance for teaching practitioners and other professionals in mathematics education.

2. Literature Review

2.1. GenAI Applications in Mathematics Education

GenAI, a branch of artificial intelligence that creates new content, offers wide application prospects in mathematics education. By enabling natural-language interaction and content generation, it provides tools for personalised learning and complex problem solving, and its application scenarios span K-12 through higher education (Wardat et al., 2023). Research indicates that GenAI promotes cognitive skills through several mechanisms. For example, it can provide instant feedback, construct conceptual frameworks and deliver multimodal visualisations to help students grasp mathematical content (Qu et al., 2025). In higher education, undergraduates who use GenAI to support mathematical proof work deepen their conceptual understanding with personalised feedback and animated demonstrations (De Simone et al., 2025). In middle school, GenAI reduces students’ cognitive load when learning number theory and algebra, thereby improving task-completion efficiency (Polydoros et al., 2025). In primary school, it helps to make abstract concepts more concrete and thus facilitates mastery of basic knowledge. GenAI also exerts a positive influence on non-cognitive skills. Studies indicate that by offering personalised learning pathways and contextualised tasks, GenAI improves students’ learning emotions and motivation. For example, after using GenAI, primary pupils reported reduced mathematics anxiety and increased classroom participation (Yoon et al., 2024). X. Wang and Wei (2025) likewise found that it effectively lowered students’ mathematics anxiety, enhanced self-efficacy and increased academic engagement. For teachers, GenAI assists in lesson design through iterative interaction, thereby reducing their burden of preparation (Yanar & Ergene, 2025). It can also simulate students in conversational practice, providing pre-service teachers with opportunities for teaching rehearsal (L. Zhou, 2025). However, several studies have identified potential risks and challenges. For example, an experiment found that undergraduates who used GenAI in calculus tests scored lower than those in a traditional teaching group (Sánchez-Ruiz et al., 2023). Scholars argue that when GenAI supplies complete solutions, students may engage less in questioning and reflection, which undermines the development of critical thinking (Ali et al., 2024). Although immediate feedback can reduce cognitive load, it may also constrain deep thinking and the transfer of concepts to complex problems (Bastani et al., 2024). Moreover, GenAI can produce incorrect or incomplete explanations, causing systematic misunderstandings (Marzano, 2025), and prolonged use has been associated with reduced learning confidence and increased technological anxiety (Al-Smadi, 2023).

2.2. Meta-Analysis Evidence of the Impact of Artificial Intelligence on Students’ Learning

GenAI in education has attracted considerable attention, and several meta-analyses have examined its overall impact. Existing research indicates that GenAI positively influences academic performance, with most effect sizes ranging from moderate to large. For instance, Gu and Yan (2025) reported a moderate effect on academic achievement (g = 0.683) and found that teacher support significantly enhanced this effect. J. Wang and Fan (2025) documented a large effect on academic performance (g = 0.867), alongside moderate effects on learning perceptions and higher-order thinking. Notably, studies employing a cognitive objective framework suggest that the impact of GenAI varies across skill levels. Qu et al. (2025) demonstrated a stronger effect on lower-order cognitive skills (g = 0.926) than on higher-order cognitive skills (g = 0.640). Since learning is shaped by multidimensional factors, recent work has also begun to assess effects on both cognitive and non-cognitive skills simultaneously (Xia et al., 2025). However, most existing comprehensive syntheses span multiple disciplines and lack a specific focus on mathematics—a field characterised by its distinct structure of thinking and instructional logic.

In mathematics education, earlier meta-analyses have examined conventional artificial intelligence technologies, such as intelligent tutoring systems and adaptive learning systems, and reported small to moderate benefits for mathematics learning. For example, Yi et al. (2025) analysed 21 studies from 2000 to 2023 and showed that artificial intelligence produced a small positive effect on K-12 students’ mathematics performance (g = 0.343), with the effect moderated by learning content and grade level. S. Hwang (2022) likewise reported a small facilitating effect of AI on primary-school mathematics learning (g = 0.351). These findings provide a foundation for understanding technology-assisted learning. However, the novel features of GenAI, notably generativity and strong interactivity, suggest that its pedagogical integration and impact mechanisms may differ. Therefore, a focused, comprehensive evaluation of generative AI’s effects within the mathematics discipline is required, together with further exploration of the specific conditions and boundaries under which those effects arise.

We specifically examine three categories of moderating variables: (1) GenAI application features, the level of GenAI integration, and learning mode (e.g., independent or collaborative); (2) intervention settings, intervention duration and sample size; and (3) educational context, educational stage and learning content. By integrating existing evidence, we provide a focused, comprehensive review to clarify the role of GenAI in mathematics teaching. We incorporate both cognitive and non-cognitive outcomes within a single analytical framework to analyse systematically how the level of technology integration influences effects. We anticipate that this work will supply more detailed evidence for future research and offer practical insights for educators designing GenAI-supported mathematics learning environments. On this basis, the study explores the following three questions through meta-analysis.

Q1: What is the overall effect of GenAI on students’ mathematics learning outcomes, distinguishing cognitive from non-cognitive skills and examining differences between higher- and lower-order cognitive skills?

Q2: To what extent is this relationship influenced by key moderating variables, including intervention duration, the level of GenAI integration, educational stage, learning content, and sample size?

3. Methods

3.1. Literature Search

This study strictly adhered to the PRISMA guidelines to ensure the methodological process was systematic and transparent. The literature screening procedure is shown in Figure 1 and comprised four consecutive stages: (1) identifying relevant records through systematic database searches; (2) conducting a preliminary screen of titles and abstracts; (3) performing full-text evaluation of potentially eligible studies; and (4) including studies that met the prespecified criteria (Page et al., 2021). Searches were carried out in major academic databases, including Web of Science, EBSCO, CNKI and Google Scholar. The search strategy centred on three concepts: generative artificial intelligence, the mathematics discipline and learning outcomes. Search terms were adjusted to match each database’s syntax and vocabulary features to balance recall and precision. The detailed search strategy appears in Appendix A Table A1. To maximise coverage, we also performed supplementary searches by tracing reference lists of included studies and following reviews in related fields.

The inclusion and exclusion criteria are presented in Table 1. Given that key applications of generative artificial intelligence emerged only at the end of 2022, many relevant empirical studies in education began to appear after 2023. This study restricts the retrieval period to studies published between 1 January 2023 and 30 September 2025. The remainder of this section outlines the detailed implementation of the research methods.

A total of 22 articles were ultimately included. To ensure their quality, we evaluated the studies using the Medical Education Research Study Quality Instrument (MERSQI). This instrument has good reliability and validity for assessing the quality of quantitative research in educational settings. It comprises 10 items across six key domains. The specific scoring rules are shown in Appendix A Table A2.

3.2. Data Encoding

Learning effectiveness serves as a critical metric for assessing educational quality and is conceptualised as a composite construct shaped by the synergistic influence of both cognitive and non-cognitive domains (Poynton, 2015; B. Wu et al., 2023). Guided by established assessment frameworks (e.g., Bloom’s taxonomy for cognitive objectives and the taxonomy for affective domains), this study operationalizes students’ learning outcomes along two dimensions: cognitive skills and non-cognitive skills.

Cognitive skills encompass the intellectual processes engaged in mathematics learning. Following the revised Bloom’s Taxonomy (Adams, 2015), they are categorised into lower-order skills (remembering, understanding, applying) and higher-order skills (analysing, evaluating, creating). Non-cognitive skills refer to the emotional, motivational, and self-regulatory factors that, while not directly involving specific knowledge acquisition, substantially influence the learning process and its outcomes. These include mathematics anxiety, self-efficacy, learning motivation, and metacognitive strategies (Marzano, 2025).

This classification reflects a basic understanding of the complexity of learning mathematics. Learning outcomes are shaped not only by cognitive processes but also by Non-cognitive factors. Moreover, Non-cognitive skills such as emotion and motivation indirectly influenced final learning outcomes by affecting cognitive engagement and the use of strategies (Kumar et al., 2025). This perspective is consistent with existing meta-analyses, which indicate that a comprehensive assessment of the impact of educational technology must consider both Cognitive and Non-cognitive dimensions (Xia et al., 2025).

Accordingly, the coding framework categorises learning outcomes into Cognitive and non-cognitive skills. Cognitive skills are further subdivided into higher- and lower-order skills, a structure consistent with Xia et al. (2025).

Based on established meta-analysis practices (e.g., Gu & Yan, 2025; J. Wu et al., 2025), a systematic classification of moderating variables, and preliminary pilot coding, this study constructed a systematic coding framework. Two coders independently implemented the coding process, adhering to a clearly defined, comprehensive and transparent protocol (Pigott & Polanin, 2020). The coding scheme comprised two main components. The first recorded basic document information, including document ID, authors, publication year and country/region. The second comprised feature coding to extract key variables such as intervention duration, educational stage, GenAI integration level, sample size, effect size, learning content and research methods.

Initially, intervention duration was classified, following J. Wu et al. (2025), as short-term (≤1 semester) and long-term (>1 semester). Because GenAI technologies emerged relatively recently, there are few long-term intervention studies lasting more than one semester. Most existing studies examine interventions of 1 to 8 weeks. Consequently, this study adopts the same dichotomy as Stephenson (2022), classifying interventions as short-term (≤1 month) or long-term (>1 month).

Educational stages are categorised into three distinct groups: primary, secondary, and tertiary education (Bartolini et al., 2025). Based on Piaget’s cognitive-development theory, students at different stages exhibit fundamental differences. Primary school students, typically in the concrete operational stage, rely on concrete and imaginative thinking. Secondary school students, transitioning into or within the formal operational stage, develop logical thinking and problem-solving abilities. In contrast, university students, having fully entered this stage, apply professional abstract reasoning within specific disciplines (Gray, 1975). Consequently, these three stages also differ markedly in their knowledge systems and learning objectives. Primary education emphasises foundational concepts and basic skills, secondary education focuses on systematic knowledge and logical structuring, while tertiary education prioritises professional modelling, theoretical proof, and higher-order application (Ghazi et al., 2016). This three-stage categorisation effectively reduces heterogeneity arising from cognitive and contextual disparities, clarifies the functional patterns of GenAI across different educational settings, and thereby enhances the interpretability and practical utility of the meta-analysis results.

The integration of GenAI in teaching is frequently analysed using the PIC–RAT model (Borup et al., 2022). This model establishes a two-dimensional framework: the vertical axis (PIC) describes the relationship between students and technology and comprises three modes—passive, interactive and creative—while the horizontal axis (RAT) indicates the technology’s impact on teachers’ methods and comprises three levels—replacement, enhancement and transformation. So, the degree of GenAI integration in teaching is classified into nine collaboration modes: creative substitution, creative enhancement, creative transformation, interactive substitution, interactive enhancement, interactive transformation, passive substitution, passive enhancement, and passive transformation (Borup et al., 2022).

Using this framework, the extent of GenAI integration in classroom teaching and its educational implications can be systematically evaluated. In the passive mode, students primarily receive information and technology functions as a presentation tool, providing limited support for active knowledge construction or social interaction; consequently, its potential to foster deep learning is theoretically low (Srivastava et al., 2024). The interactive mode foregrounds dialogue and feedback, with technology acting as a conversational partner or scaffold within sociocultural theory. By furnishing timely support within the zone of proximal development, it is theoretically more conducive to conceptual understanding and the collaborative construction of social knowledge (Ramos et al., 2025). The creative mode requires students to employ technology to generate new content or solutions and thus most directly embodies the active, exploratory knowledge construction that constructivism advocates; theoretically, it offers the greatest potential to stimulate higher-order thinking and motivation for deep learning (Srivastava et al., 2024). Existing research shows that higher-level integration, such as Creative Transformation, enhances students’ agency and deep learning outcomes (Kimmons et al., 2022). However, the literature has not systematically examined PIC–RAT as a moderating variable, which this paper investigates further.

An analysis of the 22 included studies showed that GenAI’s collaborative role with teachers and students was primarily manifested as creative transformation and as interactive/passive enhancement. Notably, the interactive-enhancement and passive-enhancement modes frequently co-exist within the same instructional setting. This pattern may arise from the inherently multi-step structure and hierarchical objectives of mathematics teaching. Classrooms involve exploratory phases that require students to engage interactively and construct understanding, alongside instructional and practice phases that focus on knowledge transmission and skill consolidation. At different stages of instruction, teachers therefore adjust GenAI’s functional role according to specific goals, producing a dynamic teaching landscape in which multiple integration modes coexist (Shrestha & Yi, 2025).

Based on an analysis of the instructional content in the included literature, the learning material was classified into four mathematical domains: Number & Algebra, Geometry, Statistics, and Integration. This categorization follows the widely adopted content framework of TIMSS (Trends in International Mathematics and Science Study), which organises mathematics learning into key areas including Number, Algebra, Geometry, and Data & Probability. In the present meta-analysis, the domains of Number and Algebra were merged into a combined “Number & Algebra” category. This consolidation reflects their shared emphasis on symbolic reasoning and procedural thinking, as well as the frequent overlap of these topics in instructional settings. The Statistics category corresponds to content involving data analysis, probability, and statistical inference within the broader “Data & Probability” domain. Finally, Integration refers to studies that explicitly address two or more of the core domains listed above (Yi et al., 2025).

Sample size was dichotomized as large (>100 participants) or small (≤100 participants) based on the conventional cutoff proposed by Bernard et al. (2014). Learning mode was categorised into two types: Independent Learning and Collaborative Learning. Independent Learning describes contexts in which students use GenAI individually and independently to acquire knowledge or complete learning tasks (K. Wang & Guo, 2025). In contrast, Collaborative Learning refers to group-based settings where students interact with one another while using GenAI as a shared tool or mediator to co-construct understanding and solve problems (K. Wang & Guo, 2025). The detailed coding criteria for all variables are presented in Table 2.

After the coders completed their independent work, they resolved discrepancies and reached consensus through discussion. A total of 46 effect sizes were coded, involving N = 5232 participants. Inter-coder consistency was measured by Cohen’s Kappa coefficient, yielding κ = 0.906. According to the criteria of Landis and Koch (1977), this value indicates that the level of agreement met the required standard and that the coding results were highly reliable. See Appendix A Table A3 for detailed coding items and Appendix A Table A4 for Research Characteristics and Effect Size Distributions.

3.3. Data Analysis

All data analysis and modelling were carried out in R (version 4.3.0) using the metafor package (version 4.0.0). The meta-analysis incorporated 46 independent effect sizes drawn from 22 studies. Hedges’g was computed for each effect size from reported means, standard deviations, sample sizes, or test statistics. Heterogeneity among studies was then assessed using the Q-test and the I² statistic, with the Q-test considered significant at p < 0.10, and I² values interpreted as follows: 25% low, 50% moderate, and 75% high heterogeneity (Borenstein et al., 2017). A random-effects model was selected for pooled analysis when heterogeneity was significant; otherwise, a fixed-effects model was considered. The Q-test was significant (p < 0.001) and I² equalled 48.94%, indicating moderate heterogeneity; consequently, a random-effects model was employed for the subsequent analyses.

To account for differences in true effects across studies and to address the potential multilevel data structure (that is, multiple effect sizes within the same study), we fitted a three-level model. The model indicated that level-2 (within-study) variance accounted for 0% of the total variance while level-3 (between-study) variance accounted for 84.18%, implying that heterogeneity principally arose from differences among studies (Assink & Wibbelink, 2016). The results of model comparisons are reported in Table 3. The two-level model showed lower AIC and BIC values, and the likelihood ratio test was not significant (p = 1.000); hence, a two-level random-effects model was deemed more appropriate. We therefore treated the 46 effect sizes as independent observations. When only a very small number of studies contribute multiple effect sizes, this approach has a negligible impact on meta-analysis results (Van Den Noortgate et al., 2015).

We adopted the restricted maximum likelihood (REML) estimator as the primary method for the random-effects model. This choice rested on methodological considerations: when the number of studies is small or when subgroups contain few observations, the DerSimonian–Laird (DL) estimator may yield unstable and biased estimates of the heterogeneity variance (Veroniki et al., 2016). REML provides more accurate estimates under finite-sample conditions.

To assess robustness, we performed a sensitivity analysis using three estimators, as shown in Table 4. The point estimates of the overall effect size from the three random-effects models were highly consistent (g = 0.535–0.547) and all reached statistical significance. The SJ estimator produced a wider confidence interval and a larger heterogeneity estimate (τ² = 0.391), reflecting its more conservative weighting of between-study variation; nevertheless, the conclusion that GenAI yields a moderate positive effect on mathematics learning remained unchanged.

These results indicate that the principal findings were insensitive to the choice of estimator and thus robust. For balance between small-sample accuracy and broad methodological acceptance, we therefore report the results based on REML.

3.4. Experimental Results

Publication bias arises when the selection of studies for publication is influenced by the direction or strength of their results, often due to preferences of journal editors, reviewers, and researchers. To assess publication bias in the included studies, this study employed multiple methods: visual inspection of funnel plots, the fail-safe number test, Egger’s linear regression test, and trim-and-fill method.

A funnel plot was drawn with the effect size (Hedges’g) on the x-axis and the reciprocal of its standard error on the y-axis (Figure 2). Under the random-effects model, the scatter was roughly symmetric and resembled a typical inverted funnel, which preliminarily suggests no obvious publication bias. Egger’s linear regression test was used to quantify funnel-plot asymmetry. The regression intercept did not deviate significantly from zero (t = 1.68, p = 0.107), so there was no statistical support for the presence of publication bias. The fail-safe number was Nfs = 3030, which far exceeded Rosenthal’s (1979) robustness threshold (5k + 10 = 120, k = 22). This implies that over 3000 unpublished null-effect studies would need to be added to render the pooled effect size non-significant, providing evidence for the robustness of the meta-analysis findings.

To further assess potential publication bias, we applied the Duval and Tweedie (2000) trim-and-fill method. After imputing three theoretically missing small-effect studies on the left side of the funnel plot, the distribution became symmetrical. Although the adjusted pooled effect size decreased slightly relative to the original estimate, it remained within the range of a statistically significant moderate positive effect. This indicates that, even when potential missing studies were taken into account, the positive effect of GenAI on mathematics learning remained robust and the core conclusion did not change.

It should be noted, however, that publication bias remains an inherent methodological concern in any meta-analysis. Although our statistical tests did not indicate significant asymmetry, the possibility of unpublished or small-effect studies cannot be entirely ruled out—a consideration particularly relevant given the emerging nature of research on GenAI in mathematics education.

On the basis of these analyses and considerations, we concluded that publication bias did not pose a substantial threat to the study’s main findings.

4. Results

This study systematically evaluated the impact of GenAI on students’ mathematics learning outcomes through a meta-analysis, encompassing both cognitive and non-cognitive dimensions. It also examined the moderating effects of intervention duration, educational stage (K-12 versus university), learning content area (algebra, geometry, etc.), sample size, and the degree of GenAI integration in instruction. To enhance transparency, the distribution of effect sizes from each independent study is displayed in a forest plot (see Figure A1 in Appendix B).

4.1. GenAI Exerts a Moderate Positive Impact on Students’ Mathematics Learning Outcomes

Based on the 22 included studies, 46 effect sizes and 5232 participants, we conducted a meta-analysis, as shown in Table 5. The overall effect size of GenAI on mathematics learning outcomes was g = 0.534 (p < 0.001), which, according to Cohen (2009)’s criteria, falls in the medium-effect range.

To examine how GenAI differentially affected ability types, the study performed a subgroup analysis using Bloom’s taxonomy of educational objectives. The analysis showed that GenAI produced a clear and significant enhancement of cognitive skills, yielding an overall effect size of g = 0.596 (p < 0.001), a moderately large effect. Further breakdown indicated a particularly strong effect on higher-order cognitive skills (g = 0.718, p < 0.001), while lower-order cognitive skills also benefited consistently (g = 0.569, p < 0.001), as shown in Table 6. By contrast, GenAI’s overall impact on non-cognitive skills was smaller and did not reach statistical significance (g = 0.299, p = 0.052); the confidence interval included the null value, indicating that current evidence was insufficient to draw a firm conclusion.

Using the GRADE framework to appraise the robustness of the evidence, we found that while the point estimates suggest a positive impact, key limitations temper the confidence in these conclusions. For overall effects and cognitive skills, moderate to substantial heterogeneity across studies indicates that the effects are context-dependent and not uniform. For non-cognitive outcomes, the very small number of available studies results in imprecise effect estimates, as reflected in wide confidence intervals that include the null value. Consequently, this latter finding in particular should be viewed as preliminary, and all conclusions are amenable to change with future research.

4.2. Regulatory Effect Analysis

To examine how moderating variables affect students’ mathematics learning outcomes, this study performed a subgroup analysis of intervention duration, grade level, teaching content, degree of GenAI integration in instruction, and sample size. The specific results are presented in Table 7.

The analysis indicates that interventions of different durations positively affect students’ mathematics learning outcomes, although the between-group difference did not reach statistical significance (p = 0.189). Specifically, short-term interventions (≤1 month) yielded a large effect size (g = 0.735, p < 0.001), while long-term interventions (>1 month) also produced a significant, though smaller, positive effect (g = 0.376, p < 0.001).

The moderating effect of teaching content on effect size was statistically significant (p = 0.013). Geometric content produced the largest effect (g = 0.906, p = 0.001), which lies in the large-effect range. Number & Algebra content followed (g = 0.784, p < 0.001), corresponding to a medium-to-large effect. Comprehensive content also showed a significant positive impact (g = 0.256, p = 0.004), amounting to a small-to-medium effect. By contrast, the Statistics category had a relatively high point estimate (g = 0.775) but an extremely wide confidence interval and did not reach statistical significance (p = 0.317). Consequently, this result was unstable and should be interpreted with caution.

When grade level was included as a moderator, the between-group difference was not statistically significant (p = 0.149). Generative AI had a significant positive effect on mathematics learning for primary school students (g = 0.754, p < 0.01), secondary school students (g = 0.313, p < 0.01), and tertiary education students (g = 0.667, p < 0.001). All effect sizes fell within the small-to-large range, and no significant difference was observed among the three educational stages.

Analysis using the PIC–RAT framework indicates significant between-group differences in how integration level affects learning outcomes (p = 0.010). The creative-transformation integration mode produced a very large positive effect (g = 1.164, p < 0.001), while the interaction/passive-enhancement mode yielded a small-to-medium effect (g = 0.443, p < 0.001). The creative-transformation mode was significantly superior to the interaction/passive-enhancement mode.

The analysis also revealed significant differences in effect sizes among different sample-size groups (p = 0.006). Small-sample studies showed a relatively large effect size (g = 0.832, p < 0.001), indicating a medium-to-large effect. In contrast, large-sample studies showed a relatively small effect size (g = 0.336, p < 0.001), suggesting a small-to-medium effect.

The analysis revealed a statistically significant difference in effect sizes between the two learning modes (p = 0.025). Both modes independently demonstrated statistically significant positive effects. Specifically, studies employing an independent learning mode yielded a moderate effect size (g = 0.592, p < 0.001). In contrast, studies adopting a collaborative learning mode showed a significantly larger effect, falling within the large range (g = 1.008, p < 0.001).

A notable consideration is the varying evidential basis across moderator subgroups. Several findings with the largest effect sizes, including those for the Creative Transformation integration mode, Geometry content, and Collaborative Learning are derived from a limited number of independent studies. This necessitates a cautious interpretation of these estimates, as their precision and stability are correspondingly lower.

5. Discussion

5.1. Responses to the First Research Question

In response to the first research question regarding the effect of GenAI on students’ mathematics learning outcomes, this study quantified its impact on both cognitive and non-cognitive skills.

This study found that the enhancing effect of GenAI on mathematical cognitive skills (g = 0.596) exceeded the results reported in previous meta-analyses of general educational AI (e.g., Yi et al., 2025; S. Hwang, 2022). This enhancement may arise from fundamental differences in interaction paradigms between GenAI and earlier AI. Traditional intelligent tutoring systems mostly provide rule-based feedback, whereas GenAI externalises the problem-solving process through chain-of-thought reasoning and natural language dialogue, thereby offering students procedural cognitive scaffolds that can reduce extraneous cognitive load (Sweller, 2011), allowing students to focus their mental resources on deeper conceptual understanding. This capability aligns precisely with the constructivist view that knowledge is actively constructed through social interaction and meaning negotiation (Sánchez Muñoz et al., 2025). Consequently, GenAI can serve as a dynamic cognitive partner in mathematical inquiry, assisting students to shift their focus from obtaining answers to constructing mathematical reasoning itself (Walkington, 2025). Ultimately, the realisation of these benefits depends on how instructional designs and student interactions strategically leverage GenAI as a cognitive tool.

A detailed analysis of cognitive skills reveals a pedagogically important trend: GenAI shows a numerically greater facilitative effect on higher-order cognitive skills (analysis, evaluation, creation; g = 0.718) than on lower-order skills (memory, comprehension, application; g = 0.569). Although the between-group difference did not reach statistical significance, possibly because of the limited number of relevant studies, this pattern suggests that GenAI’s potential is particularly pronounced in supporting tasks that require deep processing, strategic thinking and creative output. For example, in activities such as mathematical proof (Yoon et al., 2024) or problem posing (Walkington et al., 2025), GenAI can act as a thinking collaborator that supports analytical verification and generative reasoning rather than merely serving as an aid for fact recall or procedural practice. Despite the current non-significant difference, the trend provides a preliminary basis for designing GenAI integration models that emphasise the development of higher-order thinking.

The effect size for cognitive skills in this study (g = 0.596) is slightly lower than that reported for multidisciplinary applications of GenAI (Gu & Yan, 2025). This difference highlights the adaptation challenges between the current capabilities of GenAI and the demands of mathematical rigour. Mathematics learning requires logical exactitude, precision in symbol manipulation and a tightly structured body of knowledge. Research has shown that GenAI can produce plausible but incorrect arguments or minor calculation errors when handling mathematical content (Mustapha et al., 2024; Yoon et al., 2024), and these flaws may impede the development of students’ rigorous mathematical thinking. Therefore, to realise deep integration of GenAI into mathematics education, it is urgent to develop subject-specific tools that offer greater transparency in reasoning, reliable symbolic calculation and structured cognitive scaffolding (Hetmanenko & Khoruzha, 2025).

In contrast, the impact of GenAI on non-cognitive skills (g = 0.299, p = 0.052) was marginally significant, with the effect size approaching the conventional threshold and being numerically lower than reports from other fields (e.g., Xia et al., 2025). This finding may reflect the complexity and persistence of emotions related to mathematics learning. Math anxiety and self-efficacy are often intertwined with long-standing belief systems and deep-seated situational factors (Sammallahti et al., 2023). Although evidence indicates that GenAI can reduce cognitive load in specific tasks by means of step-by-step problem decomposition and immediate feedback (Cosentino et al., 2025), short-term, problem-solving-oriented interactions typically struggle to address the diverse, entrenched causes of math anxiety, such as fear of negative evaluation, performance concerns under high time pressure, and an aversion to highly abstract concepts. Therefore, GenAI system design should evolve from an “efficient problem-solving assistant” to a “companion throughout the learning process.”; Systems must incorporate finer-grained multimodal emotion recognition, analysis of learning engagement, and adaptive motivational frameworks (Barno & Phelps, 2025), thereby moving beyond mere task support to offer personalised and empathetic emotional and motivational scaffolding.

5.2. Responses to the Second Research Question

In response to the second research question, the focus is on variables that moderate effectiveness.

Both short-term (≤1 month, g = 0.735) and long-term (>1 month, g = 0.376) interventions produced positive effects on mathematics learning outcomes, with the short-term intervention showing a larger effect size. However, the difference between the groups did not reach statistical significance (p = 0.189). This finding aligns with previous meta-analyses of educational technology (Ma et al., 2014; Al-Smadi, 2023). The absence of a statistically significant advantage for longer interventions, coupled with the numerically larger point estimate for short-term studies, warrants critical examination. We propose that this pattern serves as a diagnostic mirror reflecting the predominant level of GenAI integration in current practice. The initial, robust effect in short-term interventions can be theoretically linked to the “novelty effect” and heightened “situational interest” (L. Zhou, 2025), which boost engagement and cognitive investment when a new technology is introduced. However, the attenuated effect in longer-term implementations suggests a potential pitfall of superficial integration. This approach often remains at the level of “enhancement/substitution,” primarily relying on technology to reduce short-term cognitive load and provide immediate feedback. If GenAI is used primarily for cognitive offloading (e.g., providing answers) or repetitive practice without fostering deeper cognitive partnership, its initial benefits are susceptible to the limitations outlined by Klar (2025) and may plateau or decline. This can lead to “technology dependency” (J. Liu et al., 2025), where students’ intrinsic motivation and development of metacognitive skills and self-efficacy are undermined, explaining the more modest effect over time.

Therefore, mere length of time is not the key factor. What truly matters is whether progressive instructional scaffolding is designed during technology integration and whether it can support students’ transformation from tool users to cognitive partners (Wulff & Kubsch, 2025). Future research is necessary to further uncover the mediating mechanisms and boundary conditions of GenAI’s impact on learning outcomes across different time spans. Particular attention should be paid to the dynamic interactions among technical proficiency, instructional design adaptability, and students’ self-regulation abilities.

The findings of this study indicate that GenAI positively influences mathematics learning outcomes across different educational stages (primary: g = 0.754; secondary: g= 0.313; tertiary: g = 0.667). However, the differences in effect sizes between stages were not statistically significant (p > 0.05). The higher effect size observed in primary education can be attributed to the strong alignment between GenAI’s capacity for concrete, multimodal generation and students’ cognitive need for concrete operational thinking. At this stage, mathematics learning focuses on building foundational concepts, and GenAI’s strengths in contextualization and visualisation support this process (Walkington et al., 2025; Pando & Leon, 2025). At the tertiary level, the relatively higher effect size reflects learners’ ability to engage in what can be described as “critical collaboration” with GenAI. University students generally possess more advanced formal operational thinking and metacognitive skills. They are able to use GenAI as a “thinking partner” to explore complex problems and verify reasoning (Yoon et al., 2024) while self-regulating their interaction with the technology (J. Liu et al., 2025). This allows for extended thinking through critical dialogue. In contrast, the lower effect size in secondary education coincides with a key period for developing internalised abstract logical reasoning. If GenAI is used primarily as a tool for obtaining answers, this conflicts with the goal of fostering deep reasoning (Song et al., 2024; Zhuang, 2025), potentially limiting GenAI’s positive impact. Overall, these patterns highlight how the effectiveness of GenAI is closely tied to learners’ cognitive characteristics, disciplinary tasks, and how the tool is used at each educational stage.

This study found that GenAI exerts a positive effect across mathematical content domains, but effect sizes differ significantly (p = 0.013). Geometry (g = 0.906), Number & Algebra (g = 0.784) and Statistics yielded a large but non-significant point estimate (g = 0.775, p = 0.317), whereas the effect for comprehensive content was relatively small (g = 0.256). This disparity may arise from differing degrees of alignment between GenAI’s core capabilities and the principal cognitive tasks in each domain. The relatively large effect in geometry learning might reflect GenAI’s multimodal generation, which supplies intuitive visual support. For example, in the study by Segal and Klemer (2025), teachers used GenAI to design dynamic geometry exploration tasks, suggesting that it can act as a visual mediator between abstract properties and intuitive representations and so promote spatial reasoning. The advantage observed in Number & Algebra learning may stem from a strong compatibility between GenAI’s chain-of-thought reasoning and procedural symbolic operations. Effect sizes in fields such as statistics tend to be relatively large, possibly because the demands for data processing, algorithm implementation and knowledge retrieval in these areas align naturally with the text- and code-generation strengths of GenAI. By contrast, comprehensive content learning typically requires deep concept integration, cross-domain transfer and higher-order problem solving. The smaller effect sizes observed in this domain may not reflect GenAI’s unsuitability (Manzke et al., 2025); rather, instructional design often remains limited to retrieval or practice when employing GenAI, and fails to integrate it as a cognitive partner that supports cross-domain inquiry and creative work (Wulff & Kubsch, 2025).

A further analysis suggests that the comparable effect sizes in Geometry and Number & Algebra indicate a practical ceiling for visual-spatial advantages, which likely stems from current technological limitations in complex reasoning (Oh, 2025). The non-significant effect in the field of Statistics may be due to the limited number of studies (k = 5), which inherently challenges the stability and generalizability of the observed effect.

The extent to which generative AI is integrated into teaching is a key factor modulating its effects on mathematics learning. This study found a significant difference in effect sizes between Creative Transformation (CT) and Interactive/Passive Augmentation (IPA) (p = 0.0101). Specifically, CT produced a very large effect (g = 1.164), whereas IPA produced a small-to-moderate effect (g = 0.443). When interpreting these results, it should be noted that the small number of studies in the CT category (k = 6) may reduce the stability of the estimate. Thus, the observed large effect size for CT should be interpreted with caution, pending replication in future studies with larger samples. From the perspective of the PIC–RAT integration model, this difference arises from how well each mode aligns with distinct goal levels in mathematics learning. CT ,alters the learning process through open-ended, inquiry-based tasks that commonly involve cycles of mathematical modelling, conjecture and proof (Cevikbas & Kaiser, 2021), and therefore requires students to undertake multi-step reasoning and strategy selection. For example, Polydoros et al. (2025) asked students to use ChatGPT to explore practical applications of symmetric figures, design items independently and verify their geometric properties (note: the specific version of ChatGPT was not stated in the study). This task promotes abstract reasoning and concept construction. GenAI functions as a thinking collaborator and exploration partner, and its capacity to generate diverse solution paths and perform complex reasoning (Yu et al., 2025) aligns closely with higher-order thinking activities. By contrast, IPA primarily concentrates on optimising practice and feedback within existing frameworks, and its empirical scenarios largely involve using GenAI for structured skills training (G.-J. Hwang & Tu, 2021). For example, in the study by X. Wang and Wei (2025), students used Kimi Chat for interactive exercises on geometric theorems: the tool posed questions, students answered, and when errors occurred, they received standard problem-solving steps. Although this approach improved procedural fluency, it essentially reinforced established knowledge and struggled to support deep conceptual understanding and knowledge transfer. Therefore, the notable difference in effect sizes confirmed the central conclusion that “the depth of integration determines the height of utility”. If GenAI is used merely as a tool for providing answers or practice problems (IPA), its effectiveness in enhancing learning is limited. To maximise its educational potential, instructional design must shift towards the CT model and construct a learning environment that guides students through a complete mathematical inquiry process: posing questions, modelling, reasoning, validating and reflecting. This transformation will change GenAI’s role from an answer provider to a thinking stimulator and cognitive partner.

The sample size was a significant moderator affecting the instructional effectiveness of GenAI (p = 0.0062). This study found that small-sample studies (n ≤ 100) exhibited a relatively large effect size (g = 0.832), whereas large-sample studies displayed a smaller effect size (g = 0.336). Two explanations should be considered concurrently. First, small-sample studies may have produced stronger effects because of more refined designs, closer guidance and a higher degree of personalisation (Walkington et al., 2025). Second, small-sample bias must be regarded as an important alternative explanation: small-scale studies may have overestimated effects owing to methodological flexibility or publication bias, which cautions against uncritical interpretation and generalisation. By contrast, large-scale studies commonly adopt standardised procedures, and their greater generalisability may come at the cost of reduced personalised interactions. The key challenge, therefore, lies in developing a scalable framework that systematically integrates personalised support (Barno & Phelps, 2025) to bridge the gap between scale and depth. Therefore, while pedagogical explanations are plausible, the potential for small-sample bias remains a primary alternative explanation that tempers the confidence with which the overall effect magnitude, and particularly the strong effects from small-scale studies, can be interpreted and generalised to large-scale educational contexts.

Our findings are consistent with the meta-analytic results reported by K. Wang and Guo (2025). This finding highlights the contextual adaptability of Generative AI (GenAI) in educational settings. In independent mathematics learning, GenAI functions as a personal tutor, providing step-by-step guidance and personalised feedback for exercises such as algebraic operations or geometric proofs. In collaborative learning scenarios, it acts as a facilitator, assisting groups in organising problem-solving approaches, generating visual discussion materials, and co-constructing solutions (Ye et al., 2025). This role flexibility enables both instructional modes to be effective. Furthermore, the core pedagogical value of GenAI—derived from generating contextualised content and providing immediate feedback—is inherently independent of the learning activity’s organisational format (Chen & Hou, 2024). This explains why both independent and collaborative modes exhibited comparable effectiveness in supporting mathematics learning.

Our study corroborates the findings of K. Wang and Guo (2025), confirming that both individual and collaborative learning can effectively enhance mathematics learning outcomes, with collaborative learning demonstrating a significantly stronger effect. This difference may stem from the inherent alignment between the rigorous logic of mathematics and GenAI’s capacity to function as a “dialogic partner.” In individual learning, GenAI acts as a personal tutor, whose core value lies in stimulating and sustaining continuous cognitive dialogue. For example, when learning about quadratic functions, students can ask GenAI to generate varied word problems and then question, verify, and refine the solutions and graphs it provides—a process that deepens conceptual understanding (Yoon et al., 2024; Zhuang, 2025). The larger effect size observed in collaborative learning suggests that when GenAI mediated dialogue is embedded in social interaction, a powerful synergistic effect emerges. Here, GenAI transforms from a “personal tutor” into a “collaborative cognitive tool” for the group. For instance, while exploring geometric proofs, team members can jointly propose conjectures to GenAI, ask it to generate proof strategies or counterexamples, and then discuss, critique, and synthesise the AI’s output (Walkington et al., 2025; Segal & Klemer, 2025). This collective negotiation of meaning and co-construction of knowledge significantly amplifies learning benefits (Song et al., 2024).

Furthermore, the effectiveness of GenAI is deeply dependent on disciplinary context. Its impact in mathematics education hinges on how well it integrates into specific subject practices—such as assisting teachers in designing inquiry-based tasks (Bernardi et al., 2025), developing students’ disciplinary language proficiency (Pando & Leon, 2025), or supporting the teaching of specialised content like fractal geometry (Sureda et al., 2025). Thus, GenAI is not merely a generic tool but a collaborative resource capable of embedding itself into mathematical thinking processes and fostering both logical and socially mediated cognitive development.

5.3. Practical Implications

This meta-analysis demonstrates that generative artificial intelligence (GenAI) has a moderate positive effect on mathematics learning, while multiple factors moderate its effectiveness. In particular, the mode of collaboration (individual vs. collaborative), depth of integration, learning content domain and research scale exert significant moderating effects.

For teachers, the core task is to design interventions wisely. Instructional design should strategically integrate collaborative activities where GenAI serves as a group partner. Tasks that use GenAI for exploration and creation should be prioritised over those intended only for practice. Support strategies should be tailored to the characteristics of the teaching content: when teaching structured knowledge, GenAI can provide clear steps or diagrams; when guiding students through comprehensive projects, emphasis should be placed on designing guidance, assessment and integration links.

For researchers, the immediate priority is to strengthen the robustness and depth of evidence. Future studies with larger samples and long-term follow-ups are needed to validate the effects and to conduct in depth analyses of the specific cognitive processes through which GenAI influences students’ learning. It is necessary to investigate how to implement creative transformation effectively across different teaching contents and how to design technology-intervention models that support comprehensive learning.

5.4. Limitations and Future Research

First, a key finding that warrants careful consideration is that several of the largest effect sizes in this study rest on a relatively limited evidence base. For instance, the pronounced benefits associated with the Creative Transformation integration mode, Geometry learning, and Collaborative Learning are each derived from a modest number of independent studies (k ≤ 8). Thus, these high-effect findings should be interpreted as promising yet preliminary conclusions whose robustness urgently requires verification through future large-scale research.

This limitation is further accentuated by the potential risk of small-sample bias, as indicated by the significant moderating effect of sample size. The effect sizes observed in smaller-scale studies may be influenced by methodological factors or implementation intensity. Consequently, caution is warranted when generalising these findings, particularly from studies with small samples to broader, real-world educational contexts.

In summary, a primary direction for future research is to prioritise large-scale, rigorous replication studies targeting these high-potential areas. Only such research can establish a more stable and reliable evidence base for the scalable application of GenAI in education.

6. Conclusions

This meta-analysis synthesises empirical evidence to examine the effects of GenAI on students’ mathematics learning outcomes and the contextual factors that moderate these effects. Findings indicate that GenAI exerts a moderately positive overall impact on mathematics learning, though its effects vary across outcome types. The strongest impact is observed in cognitive domains—particularly higher-order thinking—while effects on non-cognitive outcomes, though positive, are not statistically significant. Moderator analyses reveal several nuanced patterns: deeper integration of GenAI through creative transformation yields substantially larger gains than superficial interactive enhancement; benefits are more pronounced in geometry and algebra than in other content domains; and collaborative small-group use proves more effective than individual application. Caution is advised in interpreting results from small-sample studies, which may inflate effect estimates. These findings contribute to theory in educational technology and learning engagement, and offer actionable guidance for educators, educational technology developers, and policymakers advancing personalised and engaging learning ecosystems.

Author Contributions

Conceptualization, B.L. and W.Z.; methodology, B.L. and W.Z.; software, B.L.; validation, B.L., W.Z. and F.W.; formal analysis, B.L.; investigation, B.L., W.Z. and F.W.; resources, B.L. and W.Z.; data curation, B.L. and F.W.; writing—original draft preparation, B.L.; writing—review and editing, B.L. and W.Z.; visualization, F.W.; supervision, W.Z.; project administration, B.L. and W.Z.; funding acquisition, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

Funding Agency: Shaanxi Provincial Department of Education, China; Program Name: Teacher Development Research Plan Special Project (Key Project); Project Title: Research on the Difficulties and Solutions of Artificial Intelligence Precision Assistance for Rural Teachers in Western China; Grant Number: 2023JSZ012; Project Principal Investigator: Wenlan Zhang (Corresponding Author).

Institutional Review Board Statement

This study is a meta-analysis and systematic review of previously published literature. It did not involve direct interaction with human subjects or the collection of new primary data. Therefore, obtaining separate ethical approval or individual informed consent forms was not required for this secondary analysis. Our inclusion criteria required that all primary studies analyzed must have declared ethical compliance and obtained informed consent from their respective participants, in accordance with the Declaration of Helsinki.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new original data were generated in this meta-analysis. All data used in the analysis were sourced from public literature and academic databases, including Web of Science, China National Knowledge Infrastructure (CNKI), and EBSCO (search period: January 2023 to September 2025).

Acknowledgments

During the data coding and analysis phase of this study, the artificial intelligence tool DeepSeek-V3.1 assisted with part of the text extraction and categorisation. The authors manually checked and verified all coding results and accept full responsibility for the research design and conclusions. This statement is hereby made.

Conflicts of Interest

The authors declare that they have no competing interests.

Appendix A

Table A1. Literature Search Strategy.

Component	Description
Databases/Platforms	Web of Science, EBSCO (e.g., ERIC, APA PsycINFO), CNKI, Google Scholar
Time Frame	1 January 2023–30 September 2025
Search Strategy	Boolean queries were constructed by combining terms from four core conceptual groups using the AND operator:
	1. Technology: (“generative AI” OR “generative artificial intelligence” OR ChatGPT OR “GenAI” OR “large language model” OR “AI-powered” OR “AI-driven”)
	2. Subject: (math OR mathematics OR algebra OR geometry OR calculus OR statistics OR “problem-solving”)
	3. Outcome: (learn OR performance OR achievement OR outcomes OR anxiety OR attitudes OR motivation OR “computational thinking” OR skill)
	4. Population: (student OR pupil OR learner OR “elementary school” OR “primary school” OR “middle school” OR “high school” OR “undergraduate” OR “higher education”)
	The specific syntax and field codes were adapted for each database.
Additional Searches	Manual screening of reference lists and citation tracking for included studies.

Table A2. MERSQI Quality Assessment Scores.

No.	Author(s) & Year	1	2	3	4	5	7	8	9	10	Total
1	Febriantoro et al. (2024)	2	0.5	0.5	2	1	2	1	2	1	12
2	Polydoros et al. (2025)	2	0.5	0.5	2	1	2	1	2	1	12
3	Sánchez-Ruiz et al. (2023)	2	0.5	0.5	1.5	1	2	1	1	1	10.5
4	Wahba et al. (2024)	2	0.5	0.5	2	2	2	1	2	1	13
5	Yavich (2025)	3	0.5	0.5	1.5	2	2	1	2	1	13.5
6	X. Wang and Wei (2025)	2	0.5	0.5	1.5	2	2	1	1	1	11.5
7	Xing et al. (2025)	2	0.5	0.5	2	2	2	1	2	1	13
8	Kadhim and Fares (2025)	2	0.5	0.5	2	1	2	1	2	1	12
9	Noviyana et al. (2025)	2	0.5	0.5	2	2	2	1	2	1	13
10	Luo et al. (2024)	2	0.5	0.5	2	2	2	1	1	1	12
11	Dasari et al. (2024)	2	0.5	0.5	2	2	2	1	2	1	13
12	Karaman and Göksu (2024)	2	0.5	0.5	2	2	2	1	2	1	13
13	Utami et al. (2024)	2	0.5	0.5	2	2	2	1	2	1	13
14	Xuan et al. (2025)	2	0.5	0.5	2	2	2	1	2	1	13
15	Nakavachara et al. (2025)	2	0.5	0.5	2	2	2	1	2	1	13
16	Liao (2024)	2	0.5	0.5	2	2	2	1	2	1	13
17	X. C. Liu and Zhang (2025)	2	0.5	0.5	2	2	2	1	2	1	13
18	Fardian et al. (2025)	2	1	0.5	2	2	2	1	2	1	13.5
19	Adelegan (2023)	2	0.5	0.5	2	1	2	1	2	1	12
20	J. Liu et al. (2025)	2	0.5	0.5	2	2	2	1	2	1	13
21	Alvarez (2024)	2	0.5	0.5	1.5	1	2	1	1	1	10.5
22	R. Zhou et al. (2025)	2	0.5	0.5	2	2	2	1	2	1	13

Table A3. Document Coding List.

No.	Author(s) & Year	Country	Duration	Stage	Content	Integration	Size	Participant	H/L	C/NC	Learning Mode
1	Febriantoro et al. (2024)	Indonesia	Long	Primary School	Geometry	CT	Small	60	Low	Cognitive	Collaborative Learning
2	Polydoros et al. (2025)	Greece	N/A	Primary School	Geometry	IPA	Large	436	Low	Cognitive	Independent Learning
3	Sánchez-Ruiz et al. (2023)	Spain	Long	Tertiary	Integration	IPA	Large	245	Low	Cognitive	Independent Learning
4	Sánchez-Ruiz et al. (2023, a)	Spain	Long	Tertiary	Integration	IPA	Large	246	Low	Cognitive	Independent Learning
5	Sánchez-Ruiz et al. (2023, b)	Spain	Long	Tertiary	Integration	IPA	Large	241	Low	Cognitive	Independent Learning
6	Sánchez-Ruiz et al. (2023, c)	Spain	Long	Tertiary	Integration	IPA	Large	235	Low	Cognitive	Independent Learning
7	Sánchez-Ruiz et al. (2023, d)	Spain	Long	Tertiary	Integration	IPA	Large	238	Low	Cognitive	Independent Learning
8	Sánchez-Ruiz et al. (2023, e)	Spain	Long	Tertiary	Integration	IPA	Large	240	Low	Cognitive	Independent Learning
9	Sánchez-Ruiz et al. (2023, f)	Spain	Long	Tertiary	Integration	IPA	Large	245	Low	Cognitive	Independent Learning
10	Wahba et al. (2024)	Jordan	Short	Tertiary	Statistics	CT	Small	56	High	Cognitive	Independent Learning
11	Yavich (2025)	Israel	Long	Secondary School	Number & Algebra	IPA	Small	50	High	Cognitive	Collaborative Learning
12	X. Wang and Wei (2025)	China	Short	Primary School	Integration	IPA	Large	105	N/A	Non-cognitive	Independent Learning
13	Xing et al. (2025)	USA	Short	Secondary School	Number & Algebra	CT	Large	212	Low	Cognitive	N/A
14	Kadhim and Fares (2025)	Iraq	Long	Secondary School	Integration	IPA	Small	78	High	Cognitive	N/A
15	Noviyana et al. (2025)	Indonesia	N/A	Tertiary	Integration	IPA	Small	60	High	Cognitive	Independent Learning
16	Luo et al. (2024)	China	Long	Tertiary	Integration	IPA	Large	117	N/A	Non-cognitive	Collaborative Learning
17	Luo et al. (2024, a)	China	Long	Primary School	Integration	IPA	Large	117	N/A	Non-cognitive	Independent Learning
18	Dasari et al. (2024)	Indonesia	N/A	Primary School	Statistics	IPA	Small	20	Low	Cognitive	Independent Learning
19	Dasari et al. (2024, a)	Indonesia	N/A	Primary School	Statistics	IPA	Small	20	Low	Cognitive	Independent Learning
20	Karaman and Göksu (2024)	Turkey	Long	Tertiary	Geometry	IPA	Small	39	Low	Cognitive	Independent Learning
21	Utami et al. (2024)	Indonesia	Long	Primary School	Geometry	CT	Small	51	Low	Cognitive	Collaborative Learning
22	Utami et al. (2024, a)	Indonesia	Long	Tertiary	Geometry	CT	Small	51	Low	Cognitive	Independent Learning
23	Utami et al. (2024, b)	Indonesia	Long	Tertiary	Geometry	CT	Small	51	High	Cognitive	Independent Learning
24	Xuan et al. (2025)	Vietnam	Short	Tertiary	Number & Algebra	IPA	Small	60	Low	Cognitive	Collaborative Learning
25	Xuan et al. (2025, a)	Vietnam	Short	Primary School	Number & Algebra	IPA	Small	60	Low	Cognitive	Collaborative Learning
26	Nakavachara et al. (2025)	Thailand	Short	Secondary School	Statistics	IPA	Large	242	High	Cognitive	Independent Learning
27	Liao (2024)	China	Long	Primary School	Integration	IPA	Large	115	Low	Cognitive	N/A
28	Liao (2024, a)	China	Long	Secondary School	Integration	IPA	Large	115	N/A	Non-cognitive	Independent Learning
29	Liao (2024, b)	China	Long	Secondary School	Integration	IPA	Large	115	N/A	Non-cognitive	Independent Learning
30	Liao (2024, c)	China	Long	Secondary School	Integration	IPA	Large	115	Low	Cognitive	N/A
31	Liao (2024, d)	China	Long	Secondary School	Integration	IPA	Large	115	Low	Cognitive	N/A
32	Liao (2024, e)	China	Long	Secondary School	Comprehensive	IPA	Large	115	N/A	Non-cognitive	N/A
33	Liao (2024, f)	China	Long	Secondary School	Comprehensive	IPA	Large	115	Low	Cognitive	N/A
34	Liao (2024, g)	China	Long	Secondary School	Comprehensive	IPA	Large	115	N/A	Non-cognitive	N/A
35	X. C. Liu and Zhang (2025)	China	Long	Secondary School	Comprehensive	IPA	Large	115	N/A	Non-cognitive	N/A
36	Fardian et al. (2025)	Indonesia	N/A	Secondary School	Geometry	IPA	Small	205	High	Cognitive	N/A
37	Fardian et al. (2025, a)	Indonesia	N/A	Secondary School	Geometry	IPA	Small	22	High	Cognitive	N/A
38	Adelegan (2023)	USA	Short	Secondary School	Number & Algebra	IPA	Small	18	Low	Cognitive	Independent Learning
39	Adelegan (2023, a)	Nigeria	Short	Secondary School	Number & Algebra	IPA	Small	28	Low	Cognitive	Independent Learning
40	Adelegan (2023, b)	Finland	Short	Secondary School	Number & Algebra	IPA	Small	28	Low	Cognitive	Independent Learning
41	Adelegan (2023, c)	USA	Short	Secondary School	Number & Algebra	IPA	Small	62	Low	Cognitive	Independent Learning
42	Adelegan (2023, d)	Nigeria	Short	Secondary School	Number & Algebra	IPA	Small	62	Low	Cognitive	Independent Learning
43	Adelegan (2023, e)	Finland	Short	Secondary School	Number & Algebra	IPA	Small	44	Low	Cognitive	Independent Learning
44	Z. Liu et al. (2025)	China	Short	Primary School	Number & Algebra	IPA	Large	104	Low	Cognitive	Independent Learning
45	Alvarez (2024)	Philippines	Short	Tertiary	Number & Algebra	IPA	Small	20	Low	Cognitive	Independent Learning
46	R. Zhou et al. (2025)	China	N/A	Tertiary	Statistics	IPA	Small	29	Low	Cognitive	Independent Learning

Note: “N/A” indicates that the information is not applicable or was not specified in the original study. H/L: Higher-order/Lower-order cognitive outcome. C/NC: Cognitive/Non-cognitive outcome.

Table A4. Research Characteristics and Effect Size Distributions.

Characteristic	Category	Number of Effect Sizes (n)	% of Effect Sizes
Research Design	Quantitative research	8	17.39%
Research Design	Mixed-methods research	38	82.61%
Region	Asia	31	67.39%
	Europe	10	21.74%
	North America	3	6.52%
	Other (MENA, Africa)	2	4.34%
Grade Level	Primary School	10	21.74%
Grade Level	Secondary School	20	43.48%
	Tertiary	16	34.78%
Mathematics Content	Number & Algebra	12	26.09%
	Geometry	8	17.39%
	Statistics	5	10.87%
	Integration	21	45.65%
Outcome Type	Cognitive Skills	38	82.61%
	Lower-order Cognitive	30	65.22%
	Higher-order Cognitive	8	17.39%
	Non-cognitive Skills	8	17.39%
Intervention Duration	Short-term (≤1 month)	14	30.43%
	Long-term (>1 month)	25	54.35%
	Not specified	7	15.22%
Integration Degree	CT	6	13.04%
Integration Degree	IPA	40	86.96%
Sample Size	Large	23	50.00%
Sample Size	Small	23	50.00%
Learning mode	Independent Learning	29	63.04%
	Collaborative Learning	6	13.04%
	Not specified	11	23.92%

Appendix B

Figure A1. Forest Plot (Febriantoro et al., 2024; Polydoros et al., 2025; Sánchez-Ruiz et al., 2023, a,b,c,d,e,f; Wahba et al., 2024; Yavich, 2025; X. Wang & Wei, 2025; Xing et al., 2025; Kadhim & Fares, 2025; Noviyana et al., 2025; Luo et al., 2024, a; Dasari et al., 2024, a; Karaman & Göksu, 2024; Utami et al., 2024, a,b; Xuan et al., 2025, a; Nakavachara et al., 2025; Liao, 2024, a,b,c,d,e,f,g; X. C. Liu & Zhang, 2025; Fardian et al., 2025, a; Adelegan, 2023, a,b,c,d,e; Z. Liu et al., 2025; Alvarez, 2024; R. Zhou et al., 2025). Note: ▪ Point estimate. — 95% Confidence interval. │ Line of no effect (0).

References

Adams, J. B. (2015). Bloom’s taxonomy of cognitive learning objectives. Journal of the Medical Library Association, 103(3), 152–153. [Google Scholar] [CrossRef]
Adelegan, J. (2023). The impact of ChatGPT on students’ performance [Bachelor’s thesis, Lappeenranta–Lahti University of Technology LUT]. [Google Scholar]
Ali, O., Murray, P. A., Momin, M., Dwivedi, Y. K., & Malik, T. (2024). The effects of artificial intelligence applications in educational settings: Challenges and strategies. Technological Forecasting and Social Change, 199, 123076. [Google Scholar] [CrossRef]
Al-Smadi, M. (2023). ChatGPT and beyond: The generative AI revolution in education. arXiv. [Google Scholar] [CrossRef]
Alvarez, J. I. (2024). Evaluating the impact of AI–powered tutors MathGPT and Flexi 2.0 in enhancing calculus learning. Jurnal Ilmiah Ilmu Terapan Universitas Jambi, 8(2), 495–508. [Google Scholar] [CrossRef]
Assink, M., & Wibbelink, C. J. M. (2016). Fitting three-level meta-analytic models in R: A step-by-step tutorial. The Quantitative Methods for Psychology, 12(3), 154–174. [Google Scholar] [CrossRef]
Barno, E., & Phelps, G. (2025). Using a multi-agent system and evidence-centered design to integrate educator expertise within generated feedback. Education Sciences, 15(10), 1273. [Google Scholar] [CrossRef]
Bartolini, A., Batini, F., De Santis, M., Milella, M., Malavasi, P., Morganti, A., Rosati, A., Salvato, R., Signorelli, A., & Sannipoli, M. (Eds.). (2025). La formazione iniziale e continua degli insegnanti: Relazioni, comunicazione, metodi. Pensa MultiMedia. [Google Scholar]
Bastani, H., Bastani, O., Sungu, A., Ge, H., Kabakcı, Ö., & Ma-riman, R. (2024). Generative AI without guardrails canharm learning: Evidence from high school mathema-tics. Proceedings of the National Academy of Sciences, 121(6), e2321890121. [Google Scholar] [CrossRef]
Bernard, R. M., Borokhovski, E., Schmid, R. F., Tamim, R. M., & Abrami, P. C. (2014). A meta-analysis of blended learning and technology use in higher education: From the general to the applied. Journal of Computing in Higher Education, 26(1), 87–122. [Google Scholar] [CrossRef]
Bernardi, M. L., Capone, R., Faggiano, E., & Rocha, H. (2025). Generative AI in mathematics education: Pre-service teachers’ knowledge and implications for their professional development. International Journal of Mathematical Education in Science and Technology, 56(8), 1513–1530. [Google Scholar] [CrossRef]
Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2017). Introduction to meta-analysis (2nd ed.). John Wiley & Sons. [Google Scholar]
Borup, J., Graham, C. R., Short, C. R., & Shin, J. K. (2022). Evaluating blended teaching with the 4Es and PICRAT. In C. R. Graham, J. Borup, M. A. Jensen, K. T. Arnesen, & C. R. Short (Eds.), K-12 blended teaching (Vol. 2): A guide to practice within the disciplines (pp. 39–54). EdTech Books. Available online: https://edtechbooks.org/k12blended_math/evaluating_bt (accessed on 13 November 2025).
Cevikbas, M., & Kaiser, G. (2021). A systematic review on task design in dynamic and interactive mathematics learning environments (DIMLEs). Mathematics, 9(4), 399. [Google Scholar] [CrossRef]
Chen, Y., & Hou, H. (2024). A mobile contextualized educational game framework with ChatGPT interactive scaffolding for employee ethics training. Journal of Educational Computing Research, 62(7), 1517–1542. [Google Scholar] [CrossRef]
Cohen, J. (2009). Statistical power analysis for the behavioral sciences (3rd ed.). Lawrence Erlbaum Associates. [Google Scholar]
Cosentino, G., Anton, J., Sharma, K., Gelsomini, M., Giannakos, M., & Abrahamson, D. (2025). Generative AI and multimodal data for educational feedback: Insights from embodied math learning. British Journal of Educational Technology, 56(5), 1686–1709. [Google Scholar] [CrossRef]
Dasari, D., Hendriyanto, A., Sahara, S., Suryadi, D., Muhaimin, L. H., Chao, T., & Fitriana, L. (2024). ChatGPT in didactical tetrahedron, does it make an exception? A case study in mathematics teaching and learning. Frontiers in Education, 8, 1295413. [Google Scholar] [CrossRef]
De Simone, M., Tiberti, F., Barron Rodriguez, M., Manolio, F., Mosuro, W., & Dikoru, E. J. (2025). From chalkboards to chatbots: Evaluating the impact of generative AI on learning outcomes in Nigeria (Policy Research Working Paper No. 11125). World Bank Group. [CrossRef]
Duval, S., & Tweedie, R. L. (2000). Trim and fill: A simple funnel-plot-based method of testing and adjusting for publication bias in meta-analysis. Biometrics, 56(2), 455–463. [Google Scholar] [CrossRef] [PubMed]
Fardian, D., Suryadi, D., Prabawanto, S., & Jupri, A. (2025). Integrating Chat-GPT in the classroom: A study on linear algebra learning in higher education. International Journal of Information and Education Technology, 15(4), 732–751. [Google Scholar] [CrossRef]
Febriantoro, F. S., Fatharani, A., Dewi, N. C., & Kurniati, L. (2024). Assessing the efficacy of coding with Scratch and AI interaction using ChatGPT on 5th graders’ math performance and computational thinking. Reforma: Jurnal Pendidikan dan Pembelajaran, 15(1), 78–99. [Google Scholar] [CrossRef]
Ghazi, S. R., Ullah, K., & Jan, F. A. (2016). Concrete operational stage of Piaget’s cognitive development theory: An implication in learning mathematics. GUJR, 32(1), 10–20. [Google Scholar]
Gray, W. M. (1975). The factor structure of concrete and formal operations: A confirmation of Piaget (EDRS Document Reproduction Service No. ED 115 697; TM 004 972). ERIC. Available online: https://eric.ed.gov/ (accessed on 10 October 2025).
Gu, J., & Yan, Z. (2025). Effects of GenAI interventions on student academic performance: A meta-analysis. Journal of Educational Computing Research, 63(6), 1460–1492. [Google Scholar] [CrossRef]
Hetmanenko, L., & Khoruzha, L. (2025). Leveraging artificial intelligence to enhance mathematics education and overcome instructional challenges. Innovaciencia, 13(1), e5075. [Google Scholar] [CrossRef]
Hwang, G.-J., & Tu, Y.-F. (2021). Roles and research trends of artificial intelligence in mathematics education: A bibliometric mapping analysis and systematic review. Mathematics, 9(6), 584. [Google Scholar] [CrossRef]
Hwang, S. (2022). Examining the effects of artificial intelligence on elementary students’ mathematics achievement: A meta-analysis. Sustainability, 14(20), 13185. [Google Scholar] [CrossRef]
Kadhim, T. M., & Fares, I. J. (2025). The impact of the generative model supported by artificial intelligence as an advanced organizer on high-order thinking skills among middle school students in mathematics. International Journal of Environmental Sciences, 11(4s), 8–17. Available online: https://theaspd.com/index.php/ijes/article/view/414 (accessed on 13 January 2026).
Karaman, M. R., & Göksu, İ. (2024). Are lesson plans created by ChatGPT more effective? An experimental study. International Journal of Technology in Education (IJTE), 7(1), 107–127. [Google Scholar] [CrossRef]
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A., Seidel, T., … Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. [Google Scholar] [CrossRef]
Kim, H. K., Roknaldin, A., Nayak, S., Zhang, X., Yang, M., Twyman, M., & Lu, S. (2024, June 23–26). ChatGPT and me: Collaborative creativity in a group brainstorming with generative AI [Conference proceeding]. ASEE Annual Conference & Exposition, Portland, Oregon. [Google Scholar] [CrossRef]
Kimmons, R., Draper, D., & Backman, J. (2022). PICRAT. EdTechnica. Available online: https://edtechbooks.org/encyclopedia/picrat (accessed on 10 October 2025).
Klar, M. (2025). Using ChatGPT is easy, using it effectively is tough? A mixed methods study on K-12 students’ perceptions, interaction patterns, and support for learning with generative AI chatbots. Smart Learning Environments, 12(1), 32. [Google Scholar] [CrossRef]
Kumar, A., Tak, T. K., Ali, S. M. S., Haque, M., Paralkar, T. A., Kshirsagar, P. R., & Upreti, K. (2025). Predictive modeling of student learning outcomes through cognitive and emotional skill integration. International Research Journal of Multidisciplinary Scope (IRJMS), 6(1), 892–910. [Google Scholar] [CrossRef]
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. [Google Scholar] [CrossRef]
Liao, X. F. (2024). A study of the effects of generative comment feedback on learning achievement, motivation, and self-regulated learning: A case study of middle school mathematics [Master’s thesis, Central China Normal University]. Available online: https://kns.cnki.net/kcms/detail/detail.aspx?dbname=CMFD202402&filename=1024376457.nh (accessed on 13 January 2026).
Liu, J., Sun, D., Sun, J., Wang, J., & Yu, P. L. H. (2025). Designing a generative AI enabled learning environment for mathematics word problem solving in primary schools: Learning performance, attitudes and interaction. Computers and Education: Artificial Intelligence, 9, 100438. [Google Scholar] [CrossRef]
Liu, X. C., & Zhang, J. (2025). An empirical study on improving primary school students’ innovative abilities in mathematics teaching supported by generative artificial intelligence. Journal of Western Quality Education, 11(16), 100–103. [Google Scholar] [CrossRef]
Liu, Z., Zhao, Y., Zuo, H., & Lu, Y. (2025). Perceived satisfaction, perceived usefulness, and interactive learning environments as predictors of university students’ self-regulation in the context of GenAI-assisted learning: An empirical study in mainland China. Frontiers in Psychology, 16, 1599478. [Google Scholar] [CrossRef]
Luo, H., Liao, X. F., Ru, Q. Q., & Wang, Z. F. (2024). Generative AI-supported teacher comments: An empirical study based on junior high school mathematics classrooms. Journal of Educational Technology Research, 45(5), 58–65. [Google Scholar] [CrossRef]
Ma, W., Adesope, O. O., Nesbit, J. C., & Liu, Q. (2014). Intelligent tutoring systems and learning outcomes: A meta-analysis. Journal of Educational Psychology, 106(4), 901–918. [Google Scholar] [CrossRef]
Manzke, L. S., Conrad, C. D., Marchildon, P., Raisinghani, M., & Xie, R. (2025, August 14–16). Artificial intelligence in the classroom: Can GenAI teach effectively? [Conference proceeding]. AMCIS 2025 Proceedings, Montréal, QC, Canada. Available online: https://aisel.aisnet.org/amcis2025/paperathon/paperathon/2 (accessed on 4 November 2025).
Marzano, D. (2025). Generative Artificial Intelligence (GAI) in teaching and learning processes at the K-12 level: A systematic review. Technology, Knowledge and Learning, 1–41. [Google Scholar] [CrossRef]
Mustapha, K. B., Yap, E. H., & Abakr, Y. A. (2024). Bard, ChatGPT and 3DGPT: A scientometric analysis of generative AI tools and assessment of implications for mechanical engineering education. Interactive Technology and Smart Education, 21(4), 588–624. [Google Scholar] [CrossRef]
Nakavachara, V., Potipiti, T., & Chaiwat, T. (2025). Experimenting with generative AI: Does ChatGPT really increase everyone’s productivity? (Puey Ungphakorn Institute for Economic Research Working Paper). Faculty of Economics, Chulalongkorn University. [Google Scholar]
Noviyana, H., Rahmawati, F., Kirana, A. R., & Tanod, M. J. (2025). Enhancing elementary students’ mathematical problem-solving skills through AI-assisted problem-based learning. Journal of Integrated Elementary Education, 5(2), 254–268. [Google Scholar] [CrossRef]
Oh, S. (2025). Evaluating mathematical problem-solving abilities of generative AI models: Performance analysis of o1-preview and gpt-4o using the Korean College Scholastic Ability Test. IEEE Access, 13, 1227–1235. [Google Scholar] [CrossRef]
Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., & Moher, D. (2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ, 372, n71. [Google Scholar] [CrossRef]
Pando, M., & Leon, M. (2025). Mathematics disciplinary literacy: A case study of a bilingual teacher’s interaction with ChatGPT. Language and Education, 1–20. [Google Scholar] [CrossRef]
Passolunghi, M. C., De Vita, C., & Pellizzoni, S. (2020). Math anxiety and math achievement: The effects of emotional and math strategy training. Learning and Individual Differences, 79, 101868. [Google Scholar] [CrossRef]
Pigott, T. D., & Polanin, J. R. (2020). Methodological guidance paper: High-quality meta-analysis in a systematic review. Review of Educational Research, 90(1), 24–46. [Google Scholar] [CrossRef]
Polydoros, G., Galitskaya, V., Antoniou, A.-S., & Drigas, A. (2025). AI technology integration in elementary geometry and its effects on performance, anxiety levels, learning styles, cognitive styles, and executive functions. Scientific Electronic Archives, 18(2), 1–11. [Google Scholar] [CrossRef]
Poynton, K. (2015). Cognitive and non-cognitive learning factors: A literature review. Centre for Inspiring Minds. [Google Scholar]
Qu, X., Sherwood, J., Liu, P., & Aleisa, N. (2025, April 26–May 1). Generative AI tools in higher education: A meta-analysis of cognitive impact [Conference proceeding]. CHI Conference on Human Factors in Computing Systems, Yokohama, Japan. [Google Scholar] [CrossRef]
Ramos, D. S., Chaparro, I., Padilla, J., Casallas, R., Cruz, J. C., & Reyes, L. H. (2025). Integrating generative AI with the dialogic model in education: The cognitive-AI synergy framework (CASF). Preprints.org. [Google Scholar] [CrossRef]
Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638–641. [Google Scholar] [CrossRef]
Sammallahti, E., Finell, J., Jonsson, B., & Korhonen, J. (2023). A meta-analysis of math anxiety interventions. Journal of Numerical Cognition, 9(2), 346–362. [Google Scholar] [CrossRef]
Sánchez Muñoz, J. A., Flores-Eraña, G., Silva-Campos, J. M., Chavira-Quintero, R., & Olais-Govea, J. M. (2025). GenAI as a cognitive mediator: A critical-constructivist inquiry into computational thinking in pre-university education. Frontiers in Education, 10, 1597249. [Google Scholar] [CrossRef]
Sánchez-Ruiz, L. M., Moll-López, S., Nuñez-Pérez, A., Moraño-Fernández, J. A., & Vega-Fleitas, E. (2023). ChatGPT challenges blended learning methodologies in engineering education: A case study in mathematics. Applied Sciences, 13(10), 6039. [Google Scholar] [CrossRef]
Segal, R., & Klemer, A. (2025). Dialogic interactions between mathematics teachers and GenAI: Multi-environment task design and its contribution to TPACK. International Journal of Mathematical Education in Science and Technology, 1–25. [Google Scholar] [CrossRef]
Shrestha, R., & Yi, M. (2025). Pre-service teachers’ perceptions of adopting generative AI tools in teaching mathematics: Insights from a TPACK-based workshop. In R. Jake Cohen (Ed.), Proceedings of society for information technology & teacher education international conference (pp. 874–879). Association for the Advancement of Computing in Education (AACE). Available online: https://www.learntechlib.org/primary/p/225611/ (accessed on 21 October 2025).
Song, Y., Kim, J., Liu, Z., Li, C., & Xing, W. (2024). Students’ perceived roles, opportunities, and challenges of a generative AI-powered teachable agent: A case of middle school math class. Journal of Research on Technology in Education, 1–19. [Google Scholar] [CrossRef]
Srivastava, A., Vaidya, V., Murthy, S., & Dasgupta, C. (2024). GeoSolvAR: Scaffolding spatial perspective-taking ability of middle-school students using AR-enhanced inquiry learning environment. British Journal of Educational Technology, 55(6), 2617–2638. [Google Scholar] [CrossRef]
Stephenson, D. E. (2022). Effectiveness of individual-level resource building interventions in the workplace: A meta-analysis [Master’s thesis, University of Canterbury]. [Google Scholar]
Sureda, P., Parra, V., Corica, A. R., Godoy, D., & Schiaffino, S. (2025). On the role of generative AI in fractals teaching: Solutions and class proposals designed by chatbots and mathematics teachers. International Journal of Education in Mathematics Science and Technology, 13(5), 1298–1316. [Google Scholar] [CrossRef]
Sweller, J. (2011). Cognitive load theory. In J. P. Mestre, & B. H. Ross (Eds.), Psychology of learning and motivation (Vol. 55, pp. 37–76). Academic Press. [Google Scholar] [CrossRef]
Utami, I. Q., Hwang, W.-Y., & Hariyanti, U. (2024). Contextualized and personalized math word problem generation in authentic contexts using generative pre-trained transformer and its influences on geometry learning. Journal of Educational Computing Research, 62(6), 1384–1419. [Google Scholar] [CrossRef]
Van Den Noortgate, W., López-López, J. A., Marín-Martínez, F., & Sánchez-Meca, J. (2015). Meta-analysis of multiple outcomes: A multilevel approach. Behavior Research Methods, 47(4), 1274–1294. [Google Scholar] [CrossRef]
Veroniki, A. A., Jackson, D., Viechtbauer, W., Bender, R., Bowden, J., Knapp, G., Reeves, B. C., Higgins, J. P. T., Thomas, J., & Ioannidis, J. P. A. (2016). Methods to estimate the between-study variance and its uncertainty in meta-analysis. Research Synthesis Methods, 7(1), 55–79. [Google Scholar] [CrossRef]
Wahba, F., Ajlouni, A. O., & Abumosa, M. A. (2024). The impact of ChatGPT-based learning statistics on undergraduates’ statistical reasoning and attitudes toward statistics. EURASIA Journal of Mathematics, Science and Technology Education, 20(7), em2468. [Google Scholar] [CrossRef]
Walkington, C. (2025). The implications of generative artificial intelligence for mathematics education. School Science and Mathematics, 125(1), 1–8. [Google Scholar] [CrossRef]
Walkington, C., Pando, M., Lipsmeyer, L. L., Beauchamp, T., Sager, M., & Milton, S. (2025). Middle school girls using generative AI to engage in mathematical problem-posing. Mathematical Thinking and Learning, 1–22. [Google Scholar] [CrossRef]
Wang, J., & Fan, W. (2025). The effect of ChatGPT on students’ learning performance, learning perception, and higher-order thinking: Insights from a meta-analysis. Humanities and Social Sciences Communications, 12(1), 621. [Google Scholar] [CrossRef]
Wang, K., & Guo, Z. (2025). Can learners’ use of GenAI enhance learning engagement? A meta-analysis. Education Sciences, 15(12), 1578. [Google Scholar] [CrossRef]
Wang, X., & Wei, Y. (2025). The influence of Gen-AI assisted learning on primary school students’ math anxiety: An intervention study. Applied Cognitive Psychology, 39(4), e70088. [Google Scholar] [CrossRef]
Wardat, Y., Tashtoush, M. A., AlAli, R., & Jarrah, A. M. (2023). ChatGPT: A revolutionary tool for teaching and learning mathematics. Eurasia Journal of Mathematics, Science and Technology Education, 19(7), em2286. [Google Scholar] [CrossRef]
Wu, B., Chang, X., & Hu, Y. (2023). A meta-analysis of the effects of spherical video-based virtual reality on cognitive and non-cognitive learning outcomes. Interactive Learning Environments, 32(7), 3472–3489. [Google Scholar] [CrossRef]
Wu, J., Tlili, A., Salha, S., Mizza, D., Saqr, M., López-Pernas, S., & Huang, R. (2025). Unlocking the potential of artificial intelligence in improving learning achievement in blended learning: A meta-analysis. Frontiers in Psychology, 16, 1691414. [Google Scholar] [CrossRef]
Wulff, P., & Kubsch, M. (2025). Learning against the machine: The double edged sword of (Gen)AI in STEM education. International Journal of STEM Education, 12(1), 66. [Google Scholar] [CrossRef]
Xia, Q., Zhang, P., Huang, W., & Chiu, T. K. F. (2025). The impact of generative AI on university students’ learning outcomes via Bloom’s taxonomy: A meta-analysis and pattern mining approach. Asia Pacific Journal of Education, 1–31. [Google Scholar] [CrossRef]
Xing, W., Song, Y., Li, C., Liu, Z., Zhu, W., & Oh, H. (2025). Development of a generative AI-powered teachable agent for middle school mathematics learning: A design-based research study. British Journal of Educational Technology, 56, 2043–2077. [Google Scholar] [CrossRef]
Xuan, S. H., Nguyen, A. T., Nguyen, T., Nguyen, L., Nguyen, H., Pham, N., Phung, T., Ngo, B., Nguyen, V., Nguyen, M., Tran, T., Le, T., Nguyen, K., & FNU, P. (2025). Evaluating the impact of generative AI in mathematics education: A comparative study in Vietnamese high schools. Human Behavior and Emerging Technologies, 2025, 8886206. [Google Scholar] [CrossRef]
Yanar, A. N., & Ergene, Ö. (2025). Integrating artificial intelligence in education: How pre-service mathematics teachers use ChatGPT for 5E lesson plan design. Journal of Pedagogical Research, 9(2), 158–176. [Google Scholar] [CrossRef]
Yavich, R. (2025). Improving learning outcomes in advanced mathematics for underprepared university students through AI-driven educational tools. African Educational Research Journal, 13(2), 224–239. [Google Scholar]
Ye, X., Zhang, W., Zhou, Y., Li, X., & Zhou, Q. (2025). Improving students’ programming performance: An integrated mind mapping and generative AI chatbot learning approach. Humanities and Social Sciences Communications, 12(1), 558. [Google Scholar] [CrossRef]
Yi, L., Liu, D., Jiang, T., & Xian, Y. (2025). The effectiveness of AI on K-12 students’ mathematics learning: A systematic review and meta-analysis. International Journal of Science and Mathematics Education, 23(4), 1105–1126. [Google Scholar] [CrossRef]
Yoon, H., Hwang, J., Lee, K., Roh, K. H., & Kwon, O. N. (2024). Students’ use of generative artificial intelligence for proving mathematical statements. ZDM—Mathematics Education, 56(7), 1531–1551. [Google Scholar] [CrossRef]
Yu, M., Liu, Z., Long, T., Li, D., Deng, L., Kong, X., & Sun, J. (2025). Exploring cognitive presence patterns in GenAI-integrated six-hat thinking technique scaffolded discussion: An epistemic network analysis. International Journal of Educational Technology in Higher Education, 22(1), 48. [Google Scholar] [CrossRef]
Zhou, L. (2025). Interdisciplinary teaching quality monitoring for primary majors under “Double-High” policy: A case study of Hefei Preschool Education College. International Journal of Knowledge Management, 21(1), 1–20. [Google Scholar] [CrossRef]
Zhou, R., He, X., Fan, Q., Li, Y., Li, Y., Xiao, X., & Fang, J. (2025). Exploring ChatGPT-facilitated scaffolding in undergraduates’ mathematical problem solving. Journal of Computer Assisted Learning, 41, e70077. [Google Scholar] [CrossRef]
Zhuang, Y. (2025). Lessons from using ChatGPT in calculus: Insights from two contrasting cases. Journal of Formative Design in Learning, 9(1), 25–35. [Google Scholar] [CrossRef]

Figure 1. Literature screening process.

Figure 2. Funnel plot.

Table 1. Literature Screening Criteria.

Screening Stage	Inclusion Criteria	Exclusion Criteria	Literature Count
Initial Screening after	1. Records identified from Databases (Web of Science, EBSCO, CNKI), Google Scholar, Other methods	1. Repetitive literature was removed (n = 369)	Initial: 2104 After: 1658
Initial Screening after		2. Records removed for other reasons (n = 77)	Initial: 2104 After: 1658
Titles/Abstract Screening	1. Records assessed for relevance to the research topic	Records excluded as not related to the research topic (n = 1489)	Initial: 1658 After: 169
Full-Text Eligibility Assessment	1. Experimental/quasi-experimental studies employing GenAI as the intervention, with a control group using traditional instructional methods	1. Non-experimental/quasi-experimental design (n = 124)	Initial: 169 After: 22
	2. Provided complete effect size data or data calculable for effect sizes (e.g., means, standard deviations, sample sizes).	2. Incomplete data for effect size calculation (n = 22)
	3. MERSQI score ≥ 10.5 points	3. MERSQI score below 10.5 (n = 1)

Table 2. Specific Coding Criteria.

Dimension	Category	Description	References
Cognitive or Non-cognitive	Cognitive Skills	Higher-order cognitive skills; Lower-order cognitive skills	(Xia et al., 2025)
Cognitive or Non-cognitive	Non-cognitive Skills	Affective, motivational, and related abilities	(Xia et al., 2025)
Intervention Duration	Short-term	≤1 month	(Stephenson, 2022)
Intervention Duration	Long-term	>1 month	(Stephenson, 2022)
Sample Size	Small	≤100 participants	(Bernard et al., 2014)
Sample Size	Large	>100 participants	(Bernard et al., 2014)
Education Level	Primary School	Primary school students	(Bartolini et al., 2025)
	Secondary School	Junior or senior high school students	(Bartolini et al., 2025)
	Tertiary	University students	(Bartolini et al., 2025)
Learning Content	Number & Algebra	Number or Algebra	(Yi et al., 2025)
	Geometry	Geometry	(Yi et al., 2025)
	Statistics	Data and Chance	(Yi et al., 2025)
	Integration	involves two or more of the above core fields	(Yi et al., 2025)
Degree of GenAI Integration	CT	Creative Transformation	(Borup et al., 2022)
Degree of GenAI Integration	IPA	Interactive or Passive Augmentation (A combined category for interventions fitting either or both modes)	(Borup et al., 2022)
Learning mode	Independent Learning	Self-directed use of GenAI for learning	(K. Wang & Guo, 2025)
Learning mode	Collaborative Learning	Group-based interaction with GenAI for learning	(K. Wang & Guo, 2025)

Table 3. Model Fit Comparison Results.

	DF	AIC	BIC	AICc	logLik	LRT	p-Value	QE
Full	3	94.98	100.4	95.57	−44.49			90.91
Reduced	2	92.98	96.6	93.27	−44.49	0	1	90.91

Table 4. Sensitivity Analysis of Effect Sizes Across Different Estimators.

	Estimate	SE	CL	Tau²	ST	I²	H²	Q	P
DL	0.535	0.098	[0.343, 0.728]	0.2142	0.463	50.5	2.02	90.91	0.001
REML	0.534	0.097	[0.345, 0.723]	0.2013	0.449	48.9	1.96	90.91	0.001
SJ	0.547	0.117	[0.318, 0.776]	0.3908	0.391	65.1	2.86	90.91	0.001

Table 5. Meta-Analytic Results of GenAI’s Effects on Mathematics Learning Outcomes: Overall Effects, Cognitive/Non-Cognitive Skills.

Outcome Variables	n	g	95% CI	Q	P
Overall	46	0.534 ***	[0.345, 0.723]
Cognitive	38	0.596 ***	[0.367, 0.824]	1.355	0.2443
Non-cognitive	8	0.299	[−0.003, 0.601]

Note. *** p < 0.001.

Table 6. Meta-Analytic Results of Gen AI’s Effects on Mathematics Learning Outcomes: Higher/Lower-Order Cognitive Skills.

Outcome Variables	n	g	95% CI	Q	P
Cognitive-high	8	0.718 ***	[0.344, 1.092]	1.735	0.42
Cognitive-low	30	0.569 ***	[0.298, 0.840]

Note. *** p < 0.001.

Table 7. Subgroup Analysis Results of Moderator Variables.

Variables			n	g	95% CI	Q	P
intervention settings	Intervention Duration	Long	25	0.376 ***	[0.172, 0.579]	3.330	0.1892
		Short	14	0.735 ***	[0.468, 1.002]
		N/A	7	0.672	[−0.466, 1.810]
	Sample Size	Small	23	0.832 ***	[0.470, 1.193]	7.501	0.0062
	Sample Size	Large	23	0.336 ***	[0.151, 0.522]
educational context	Learning Content	Integration	21	0.256 **	[0.081, 0.431]	10.750	0.0131
		Geometry	8	0.906 **	[0.366, 1.446]
		Number & Algebra	12	0.784 ***	[0.469, 1.098]
		Statistics	5	0.775	[−0.742, 2.293]
	Grade Level	Primary School	10	0.754 **	[0.196, 1.313]	3.811	0.1487
		Secondary School	20	0.313 **	[0.105, 0.520]
		Tertiary	16	0.667 ***	[0.285, 1.049]
GenAI application features	Integration Degree	CT	6	1.164 ***	[0.656, 1.673]	6.624	0.0101
	Integration Degree	IPA	40	0.443 ***	[0.252, 0.634]
	Learning Mode	Independent Learning	29	0.592 ***	[0.328, 0.856]	7.372	0.0251
	Learning Mode	Collaborative Learning	6	1.008 ***	[0.522, 1.494]

Note. ** p < 0.01, *** p < 0.001. “N/A” indicates that the information is not applicable or was not specified in the original study.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, B.; Zhang, W.; Wang, F. Can Generative Artificial Intelligence Effectively Enhance Students’ Mathematics Learning Outcomes?—A Meta-Analysis of Empirical Studies from 2023 to 2025. Educ. Sci. 2026, 16, 140. https://doi.org/10.3390/educsci16010140

AMA Style

Liu B, Zhang W, Wang F. Can Generative Artificial Intelligence Effectively Enhance Students’ Mathematics Learning Outcomes?—A Meta-Analysis of Empirical Studies from 2023 to 2025. Education Sciences. 2026; 16(1):140. https://doi.org/10.3390/educsci16010140

Chicago/Turabian Style

Liu, Baoxin, Wenlan Zhang, and Fangfang Wang. 2026. "Can Generative Artificial Intelligence Effectively Enhance Students’ Mathematics Learning Outcomes?—A Meta-Analysis of Empirical Studies from 2023 to 2025" Education Sciences 16, no. 1: 140. https://doi.org/10.3390/educsci16010140

APA Style

Liu, B., Zhang, W., & Wang, F. (2026). Can Generative Artificial Intelligence Effectively Enhance Students’ Mathematics Learning Outcomes?—A Meta-Analysis of Empirical Studies from 2023 to 2025. Education Sciences, 16(1), 140. https://doi.org/10.3390/educsci16010140

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Can Generative Artificial Intelligence Effectively Enhance Students’ Mathematics Learning Outcomes?—A Meta-Analysis of Empirical Studies from 2023 to 2025

Abstract

1. Introduction

2. Literature Review

2.1. GenAI Applications in Mathematics Education

2.2. Meta-Analysis Evidence of the Impact of Artificial Intelligence on Students’ Learning

3. Methods

3.1. Literature Search

3.2. Data Encoding

3.3. Data Analysis

3.4. Experimental Results

4. Results

4.1. GenAI Exerts a Moderate Positive Impact on Students’ Mathematics Learning Outcomes

4.2. Regulatory Effect Analysis

5. Discussion

5.1. Responses to the First Research Question

5.2. Responses to the Second Research Question

5.3. Practical Implications

5.4. Limitations and Future Research

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI