Next Article in Journal
TransferLearning-Driven Large-Scale CNN Benchmarking with Explainable AI for Image-Based Dust Detection on Solar Panels
Previous Article in Journal
FDSDS: A Fuzzy-Based Driver Stress Detection System for VANETs Considering Interval Type-2 Fuzzy Logic and Its Performance Evaluation
Previous Article in Special Issue
Listen Closely: Self-Supervised Phoneme Tracking for Children’s Reading Assessment
error_outline You can access the new MDPI.com website here. Explore and share your feedback with us.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Insights on the Pedagogical Abilities of AI-Powered Tutors in Math Dialogues

by
Verónica Parra
1,2,3,
Ana Corica
1,2,3 and
Daniela Godoy
2,3,*
1
Núcleo de Investigación en Educación Matemática (NIEM), Universidad Nacional del Centro de la Provincia de Buenos Aires (UNCPBA)-Comisión de Investigaciones Científicas de la Provincia de Buenos Aires (CICPBA), Tandil B7000, Argentina
2
Instituto Superior de Ingeniería de Software de Tandil (ISISTAN), Universidad Nacional del Centro de la Provincia de Buenos Aires (UNCPBA)-Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Tandil B7000, Argentina
3
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires C1425, Argentina
*
Author to whom correspondence should be addressed.
Information 2026, 17(1), 51; https://doi.org/10.3390/info17010051
Submission received: 2 December 2025 / Revised: 26 December 2025 / Accepted: 1 January 2026 / Published: 6 January 2026
(This article belongs to the Special Issue AI Technology-Enhanced Learning and Teaching)

Abstract

AI-powered tutors that interact with students in question-answering scenarios using large language models (LLMs) as foundational models for generating responses represent a potential scalable solution to the growing demand for one-to-one tutoring. In fields like mathematics, where students often face difficulties, sometimes leading to frustration, easy-to-use natural language interactions emerge as an alternative for enhancing engagement and providing personalized advice. Despite their promising potential, the challenges for LLM-based tutors in the math domain are twofold. First, the absence of genuine reasoning and generalization abilities in LLMs frequently results in mathematical errors, ranging from inaccurate calculations to flawed reasoning steps and even the appearance of contradictions. Second, the pedagogical capabilities of AI-powered tutors must be examined beyond simple question-answering scenarios since their effectiveness in math tutoring largely depends on their ability to guide students in building mathematical knowledge. In this paper, we present a study exploring the pedagogical aspects of LLM-based tutors through the analysis of their responses in math dialogues using feature extraction techniques applied to textual data. The use of natural language processing (NLP) techniques enables the quantification and characterization of several aspects of pedagogical strategies deployed in the answers, which the literature identifies as essential for engaging students and providing valuable guidance in mathematical problem-solving. The findings of this study have direct practical implications in the design of more effective math AI-powered tutors as they highlight the most salient characteristics of valuable responses and can thus inform the training of LLMs.

1. Introduction

The remarkable success of large language models (LLMs) as conversational systems fostered the development of AI tutors that interact with students for assistance in diverse subjects. The use of LLMs as foundational models underlying AI tutors, however, raises several issues that require careful examination. One of the most prominent concerns is related to their probabilistic output, which can lead to unreliable results and potentially expose students to incorrect information. For tutoring in areas such as mathematics, this matter acquires special relevance as LLMs’ lack of real symbolic reasoning often leads to serious mathematical mistakes [1], which may adversely affect students. In addition, a frequently overlooked issue in the evaluation of AI tutors concerns their ability to implement appropriate pedagogical strategies. Taking into account both aspects, there are no guarantees on the validity or appropriateness of the tutor assistance [2].
Often neglected, the pedagogical abilities of LLM-powered AI tutors become essential in educational contexts where students must receive sufficient, helpful, and accurate guidance. Ideally, a good tutor should avoid revealing the answer immediately and instead offer relevant and supportive guidance, such as hints, explanations, or guiding questions [3]. Some efforts have been made to incorporate pedagogical dimensions into the evaluation of AI tutoring systems, taking into account whether they speak like a teacher, understand a student, and help a student [4,5]; whether they are coherent, correct, and equitable [6]; or focusing on their usefulness, care, and humanness [7]. Recently, a unified taxonomy was defined in [3] to assess the pedagogical value of LLM-powered AI tutor responses starting from student mistakes or confusions in the mathematical domain. This taxonomy includes mistake identification, mistake location, revealing of the answer, providing guidance, actionability, coherence, tutor tone, and human-likeness as relevant pedagogical issues.
In this paper, we aimed at inspecting how LLM-powered tutors behave in terms of these pedagogical aspects by automatically extracting features from the text of tutor responses to students. The characteristics of the texts generated by LLMs are analyzed with respect to writing style (readability and lexical richness), tone (sentiment, emotions, and personality), and dialog coherence, with the aim of gaining insight into how interactions with students unfold. Features spanning multiple dimensions of analysis are explored in this work, leveraging text processing techniques as well as lexicons and pre-trained models to automatically extract their values from the text of responses and thereby characterize LLM pedagogical competencies that are necessary for effective AI tutoring in math educational dialogues.
The study carried out, therefore, is guided by the two following research questions:
RQ1:
To what extent are a given dimension of analysis (e.g., linguistic, emotional, etc.) and its associated set of features useful to characterize the guidance provided by tutors in their responses to students? In other words, can pedagogically competent responses from tutors be distinguished along a specific dimension of analysis?
RQ2:
To what extent do the responses of LLM tutors differ in a certain dimension among themselves and with respect to human tutors?
This work is organized as follows. Section 2 summarizes the previous works that investigated the pedagogical abilities of AI-based tutors. Section 3 postulates the research questions that guided this study and describes the data and methods for experimentation. Section 4 reports the results of the experimentation carried out. Finally, the conclusions are drawn in Section 5.

2. Related Works

2.1. AI Tutors in the Math Domain

Recent advances in the learning of LLMs have enabled the creation of AI-based conversational tutors that can engage in educational dialogues with students. Easy interaction with natural language and fine-tuning possibilities of LLMs (i.e., adding data regarding training for specific tasks or domains) fostered the development of these tutors for assisting students across various disciplines [8,9,10,11,12]. Interactive learning through tutoring dialogues between AI tutors and students has the potential to tailor instruction to individual student needs while offering a scalable solution for one-to-one tutoring.
In the field of mathematics, the progress of LLMs in problem-solving has prompted interest in their role as tutoring tools for students. However, despite their remarkable capabilities, studies have shown that LLMs often provide solutions that sound plausible but are mathematically incorrect, especially when precise calculations and multi-step reasoning are required [13]. The responses of LLMs are inherently unreliable when it comes to complex math problem-solving. Mirzadeh et al. [14] found that LLMs exhibit substantial variability when answering different formulations of the same question, highlighting their fragility in mathematical reasoning. Jiang et al. [15] showed, with statistical guarantees, that LLMs’ success in reasoning largely depends on recognizing superficial patterns with strong token bias. This fact raises concerns about their actual reasoning and generalization abilities. The reliance of LLMs on pattern matching, rather than on genuine symbolic reasoning, leads to flaws in mathematical inference [16].
Not only is the correctness of the final solution crucial when helping a student with a math problem but so is the quality of the guidance that fosters effective learning [13]. A few works have addressed this aspect during the development of math tutors. In [17], a multi-agent AI tutoring environment that combines adaptive Socratic agents, dual-memory penalization, GraphRAG textbook retrieval, and Directed Acyclic Graph (DAG)-based course planning was presented. The system prioritizes deep understanding and supporting students in developing independent problem-solving skills over direct answers. The evaluation of this tutor was oriented to measure its ability to guide learning without prematurely revealing answers while still steering the student toward a successful solution. Kestin et al. [18] reported a controlled trial assessing learning outcomes and perceptions of college students when content was delivered through an AI tutor in comparison with an active learning class. The AI tutor was trained with the same pedagogical best practices used in the in-class lessons. The experiment offered empirical evidence of the efficacy of AI-powered pedagogy for enhancing learning outcomes, thus becoming a compelling case for its broad adoption in learning environments.
In line with this research, the present study seeks to determine which features of responses are more pedagogically oriented to guide the ongoing development of AI-based mathematics-specific tutors.

2.2. Pedagogical Aspects of AI Tutors

Although LLMs exhibit friendly natural-language interaction and, to some extent, problem-solving capabilities, these features do not necessarily translate into effective tutoring support for students as important pedagogical challenges also need to be addressed. For example, Gupta et al. [13] concluded that state-of-the-art LLMs like GPT-4 are effective question-answering systems but often not competent as tutors.
The literature identifies several factors that contribute to effective pedagogical responses, with broad agreement on a set of core pedagogical principles. Among these principles is the promotion of active learning, which entails avoiding the immediate disclosure of the correct answer. It has been observed that, when ChatGPT (https://chatgpt.com/) is prompted to tutor a student in the role of a teacher, it directly reveals the solution in 66% of cases and provides incorrect feedback in 59% of cases [6]. Although this aspect quickly emerged in interactions with chatbots, there are more profound theoretical educational issues that must be considered.
A taxonomy unifying the most relevant aspects of a good tutor response is proposed in [3]. Four high-level pedagogical principles were defined in this taxonomy: (1) encourage active learning, (2) adapt to students’ goals and needs, (3) manage cognitive load and enhance metacognitive skills, and (4) foster motivation and stimulate curiosity.
The first of these principles corresponds to the problem mentioned before, the tendency of chatbots to reveal answers too quickly, thereby hindering opportunities for more desirable active learning scenarios [19], such as the one that seems to be effective in [18]. The second of these principles is associated with adapting and/or personalizing instruction to meet students’ needs instead of following pre-defined learning paths or offering content for average profiles. The third principle addresses cognition and metacognition, referring to the management of the amount of information delivered to students and the enhancement of their ability to abstract and generalize beyond individual problems. Ultimately, there are motivational aspects of teaching, such as stimulating curiosity throughout the dialogue. Maurya et al. [3] related this principle to actions like providing actionable knowledge, using an encouraging tone, and behaving like an expert.
The study presented in this work aims to link several dimensions that can be derived from the text in order to examine LLM-generated responses through the lens of these principles.

2.3. Evaluation of Pedagogical Aspects in AI Tutors

Assessing the pedagogical abilities of AI tutors, particularly those powered by LLMs, becomes highly relevant to help in the development of enhanced chatbots designed for this specific task in the education domain. For example, in [20], an approach was proposed for training LLMs to generate tutor utterances that maximize correctness while still promoting adherence to sound pedagogical practices. The authors designed a pedagogical rubric that specifies multiple properties that tutor utterances should adhere to. In [4], open-domain chatbots such as Blender and GPT-3 were evaluated in parallel to human teachers in language and mathematics educational dialogues. Comparing their responses in terms of pedagogical ability, Blender outperformed the actual teacher and GPT-3 in uptake, achieving higher scores on this specific pedagogical dimension.
Frameworks such as MathDial [6] and MathTutorBench [21] were proposed for benchmarking the pedagogical abilities of AI tutors. MathDial contains one-to-one teacher–student tutoring dialogues grounded in multi-step math reasoning problems. Human teachers, acting as domain experts and collaborating with an LLM, were employed for annotation purposes. MathTutorBench is a collection of datasets and metrics designed to holistically evaluate dialogue-based models for mathematics tutoring. This framework evaluates tutors across three categories: subject-matter expertise; student understanding, which involves the possibility of the tutor to verify, identify, and correct student solutions; and teacher response generation, which assesses the scaffolding capabilities of the tutor.
A recent effort to carry out a comprehensive evaluation of AI tutor responses using a set of pedagogically motivated metrics focused on educational dialogues between students and tutors in the mathematical domain took place in the BEA shared task (https://sig-edu.org/sharedtask/2025, accessed on 1 December 2025) [22] in the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) (https://sig-edu.org/bea/2025, accessed on 1 December 2025). The conversations in the data provided as part of this task are grounded in student mistakes or confusion, with the AI tutor aiming to remediate them. For assessing the pedagogical abilities of AI tutors powered by LLMs, the task organizers released a dataset containing dialogues annotated along several pedagogically motivated dimensions. One of these dimensions corresponds to whether the AI tutor provides guidance to solve the mistake or confusion in an effective way. Regarding this aspect, responses were annotated as Yes/No/To some extent and participating teams approached the problem as a classical multi-class text classification one. Thus, methods such as the use of LLMs, LoRA-based fine-tuning, ensembles, and hybrid architectures were used in conjunction with data manipulation techniques, such as data augmentation and imbalance handling. In fact, the best-performing approach proposed in [23] uses an instruction-tuned Mathstral-7B-v0.1 model and leverages parameter-efficient fine-tuning (LoRA) along with ensemble-based inference. Another solution that performs well in other metrics is the one proposed by [24], which explores in-context learning (ICL), supervised fine-tuning (SFT), and reinforcement learning (RLHF). Following the task goals, the cited works primarily focused on optimizing prediction metrics (e.g., F1 score and accuracy) while leaving aside the analysis of the factors that make a response effective in guiding students toward correcting errors or clarifying misconceptions.
Rather than focusing on prediction optimization, this work seeks to explain the factors that make responses pedagogically meaningful. Feature extraction from textual data using natural language processing (NLP) techniques enables the analysis and comparison of different pieces of text conforming to the answers. Accordingly, this work aims to extract a set of features from the textual responses of tutors in order to better characterize them across pedagogical dimensions identified in the literature, thereby helping to distinguish the features that make responses more valuable in terms of guidance.

3. Materials and Methods

For studying the characteristics of AI-powered tutors, a range of dimensions describing their responses were considered, including linguistic patterns, sentiment, emotions, and writing style, among others. Each of these dimensions is linked to one or more of the principles outlined in the previous section, which are considered essential for good AI tutors. A wide range of features were explored within a dimension, leveraging text processing techniques, lexicons, and pre-trained models to quantify the different aspects starting from the textual data. The different dimensions of analysis were observed in a dataset of math dialogues in which LLM-based tutors, as well as human tutors, try to help a student in solving a mistake or confusion regarding a math problem. The dataset is annotated regarding the quality of the guidance provided by the tutors.
The next subsections outline the dimensions of analysis considered in this study, the features extracted within each dimension, the data used for experimentation, and the methods employed for automatic feature extraction.

3.1. Data Description

The data used in this work was released as part of BEA 2025 shared task (https://sig-edu.org/sharedtask/2025, accessed on 1 December 2025) [3] on pedagogical ability assessment of AI-powered tutors. It includes annotated tutor responses for 300 dialogues extracted from the MathDial [6] and Bridge [7] datasets. Each dialogue consists of the contextual information, including several prior turns from both the tutor and the student. Each dialogue includes either a mistake or a confusion expressed by the student, the last utterance from the student, and a set of responses given by 7 LLM-based tutors and human tutors (a single tutor in MathDial, and an expert and novice tutor in Bridge), annotated for their pedagogical quality. The models were prompted to act as expert tutors as described in [3].
The development set provided by the shared task consists of over 2480 tutor responses. The LLM-based answers were generated by the shared task organizers using a set of state-of-the-art LLMs of diverse nature, namely Sonnet [25], Llama318B [26], Llama31405B [26], GPT4 [27], Mistral [28], Gemini [29], and Phi3 [30].
The responses from tutors were annotated, alongside other pedagogical aspects of interest (such as mistake identification, mistake location, and actionability) according to the guidance they provided. The annotation was conducted on a 3-point scale, where No denotes that the particular aspect of the tutor response is bad (e.g., do not provide useful guidance or reveal the answer too quickly), Yes denotes that it is good (e.g., the response is helpful), and To some extent denotes that the quality of the response in terms of guidance is medium. The majority class in the dataset is Yes, 56.83% of the responses, while To some extent and No represent 20.31% and 22.86% of the dataset, respectively.
Figure 1 shows the distribution of the three classes Yes/No/To some extent in the responses of the different LLM-based tutors, as well as those of the responses of Expert and Novice human tutors. As can be observed in the figure, many of the tutors powered by LLMs seem to provide better guidance overall than others and even better than Expert tutors. Others, such as Phi3 and Gemini, are notoriously inferior in this regard while still achieving performance comparable to that of Novice tutors. It is worth noticing, however, that this performance only refers to the pedagogical aspect being considered, the capacity of providing guidance, but not to the accuracy of the answer in mathematical terms.

3.2. Analyzed Dimensions and Feature Extraction

For examining the responses of tutors, a set of features spanning multiple dimensions was extracted from the text of tutor responses. Thus, the values of these features in relation to the type of guidance provided can be analyzed. The texts were processed to automatically extract scores using both established lexicons and pre-trained models from the NLP field. The dimensions analyzed, the relation with the four principles in the Maurya et al. [3] taxonomy, and the tools used for extracting features within each of them are the following:
  • Readability and lexical richness: these features are oriented to determine the level of difficulty and richness in the vocabulary of a tutor response. The examination of this dimension is grounded in the premise that tutors should provide answers understandable to students, taking into account their academic level and background. It aims to evaluate how the AI tutor performs with respect to the second principle of the taxonomy, namely adaptation to students’ goals and needs. The readability of a piece of text allows to determine its difficulty regarding the level of instruction needed to understand it. Different readability formulas are typically applied to texts to quantify this aspect, including the Flesch Reading Ease and Automated Readability Index (ARI), among others. Most readability formulas work by analyzing the length of sentences, syllables, or letters. Shorter sentences and words with fewer letters and syllables are considered simpler and easier to read. Additionally, the diversity of vocabulary employed in texts is a common parameter used as a proxy for establishing the level of language sophistication, starting from the information on the occurrences of unique words. In other words, the more unique words appearing in the text produced by a tutor, the higher the language sophistication it exhibits.
  • Sentiment, emotions, and personality: these dimensions explore human-likeness of responses as effective tutoring requires that students feel a connection with the tutor, which is more likely when the responses appear human-like rather than robotic [3,7]. This dimension explores the four principles of the taxonomy, including how the AI tutor fosters motivation and stimulates curiosity. In order to extract features describing the behavior of LLM tutors in this dimension, we leverage several available lexicons and pre-trained models. Sentiment-related scores were extracted with VADER (Valence Aware Dictionary and sEntiment Reasoner) [31], a lexicon- and rule-based sentiment analysis tool. Emotions were obtained with EmoRoBERTa [32], a detection model based on the RoBERTa architecture and designed to identify 28 different emotional categories. Personality traits within the Big Five model (agreeableness, extraversion, openness, conscientiousness, and neuroticism) were also analyzed based on the predictions of a pre-trained model.
  • Coherence and uptake: this aspect highlights the alignment of the response with the ongoing dialogue between tutor and student; i.e., it is intended to determine whether the response effectively functions as a follow-up within the dialogue thread. According to [3,6], high-quality responses from tutors should be logically consistent with the student’s previous communications. Two principles of the taxonomy are associated with this dimension: encouraging active learning and managing cognitive load while enhancing metacognitive skills. To assess this coherence, we measure the similarity between the text of the response and the preceding dialogue, assuming that a semantic relationship should exist between them. Models for assessing textual semantic similarity were employed with this goal, both a general-purpose one as well as a math-specific one. Beyond overlapping and similarity, there is a more complex notion of uptake referring to how a teacher recognizes, interprets, and responds to a student’s contributions during instruction. For example, if a student provides a partial answer, teacher uptake is related to the ability to acknowledge the correct part and extend the explanation or reframe the question to enhance the learning experience. This interaction is fundamental for effective teaching as it reflects the responsiveness of teachers to the student’s process of thinking and understanding. For assessing this important factor in the interaction with tutors, a model specifically trained for predicting uptake in math dialogues was applied [33].
    The exploration of uptake was further complemented by observing the cognitive processes appearing in responses as defined by the Linguistic Inquiry and Word Count (LIWC) [34], which include the presence of words related to insight, causation, discrepancy, differentiation, tentativeness, and certainty.
By considering these three analytical dimensions and in light of the four principles that characterize an effective AI tutor, it becomes possible to refine both RQ1 and RQ2 as follows:
RQ1.1/RQ2.1:
To what extent do LLM-powered tutors adapt readability and lexical level to the targeted students, and how do they compare with human tutors?
RQ1.2/RQ2.2:
Can LLM-based tutors exhibit human-like characteristics that foster motivation and curiosity, and how do they compare with human tutors in this regard?
RQ1.3/RQ2.3:
How do LLM-based tutors manage uptake in dialogue to support active learning and the development of metacognitive skills?

4. Results

The analysis of the aforementioned general dimensions is operationalized through the extraction of a set of features that quantify them. In this section, we present the exploratory analysis conducted considering each dimension.

4.1. Readability and Lexical Richness

In this subsection, research questions RQ1.1 and RQ1.2 are examined, that is, how LLM-powered tutors adapt to students’ needs in terms of readability and lexical features. Readability indicates the ease of reading of a text, which derives from the choice of content, style, structure, and organization [35]. Automated readability assessment enables the quantification of the difficulty with which textual information can be read and understood. Different readability formulas are available to estimate text difficulty. These formulas rely on word and sentence length and produce a numerical estimate of the grade level needed to comprehend the text. Therefore, a set of readability features were used for analyzing the responses of tutors in this regard.
The scores of readability metrics were extracted from the text of the tutor responses using Textstat (https://pypi.org/project/textstat/, accessed on 1 December 2025) library. The formulas used include Flesch Reading Ease, Flesch–Kincaid Grade Level, SMOG Index, Coleman–Liau Index, Automated Readability Index, Linsear Write Formula, Gunning Fog Index, and the Readability Consensus metrics based on all the above tests [36].
Figure 2 depicts the scores for the mentioned metrics according to the guidance provided by the responses. Except for Flesch Reading Ease, the higher the score, the more difficult the text; i.e., increased instruction is required to understand it. In general terms, it can be observed in the figure that the responses providing good guidance (belonging to the class Yes) are slightly more difficult to read than those that do not (belonging to the class No). Although not large, the differences are statistically significant, with a p value of <0.05 in all cases. This result can be attributed to the complexity of responses provided when teachers provide effective guidance. Likely, an answer aimed at offering feedback and building on a student’s contribution would employ more specific vocabulary and then be more challenging to read and comprehend.
Since responses originate from multiple tutors and readability may vary depending on the text generation mechanism and the data used to train each foundational LLM, it is necessary to examine the individual behavior of tutors in this regard. Figure 3 shows the Flesch Reading Ease, Automated Readability Index, and Readability Consensus scores of each of the seven LLM-based tutors and human tutors in relation to the level of instruction needed to understand them as determined by each formula. This level is considered to take as reference the US school level displayed on the right y-axis of the figures.
The results are consistent with the tendency shown in Figure 2, but variations among tutors can be observed. Based on Flesch Reading Ease, almost all tutors provide answers of higher difficulty (lower score) when good guidance (Yes) is provided. However, Llama31405B, Mistral, and Novice tutors had a greater difference among the scores in both classes (Yes and No), even requiring one or two grades of instruction more to be capable of understanding the response. For example, Llama31405B responses classified as No are understandable starting from 8th–9th grades, while those classified as Yes need almost college-level students. Something similar happens with Mistral and GPT4 responses. Contrary to the general tendency, Gemini and Phi3 produced answers that are harder to understand, even when they do not provide appropriate guidance. Figure 3b confirms this behavior measured with the Automated Readability Index. Readability Consensus also reflects this, but the results are smoothed due to the aggregation of all the metrics.
Expert as well as Novice human tutors show a notable difference with respect to LLM-based tutors in this aspect. Their responses, independently of their type (Yes, No, or To some extent), are much easier to understand than those of LLMs. In both cases, at most a 7th-grade level is required in Flesch Reading Ease and an even lower grade level according to the two other metrics to understand the responses. This is a particularly interesting result as it suggests that a higher level of instruction is required to interact with LLMs in this scenario. Thus, prompting strategies have to be deployed to contextualize the answers to the level of students, or tutors need to be fine-tuned to produce easier responses to adapt to a given level. Although the level is not indicated in the dataset for this particular dataset, human tutors produce fairly easy-to-understand responses compared to LLMs for the same dialogues, a key issue in tutoring.
Another aspect of the writing styles of tutors is their lexical richness or lexical diversity. This refers to the range and variety of vocabulary deployed in a text by a writer (in this case, tutors), which is indicative of a wide variety of variables, including writing quality, vocabulary knowledge, and general characteristics of speaker competence [37].
Lexical diversity is usually measured through the calculation of type–token ratio (TTR), where the counts of unique words are divided by their occurrences. Herdan and Dugast formulas are also metrics derived from this notion. Measure of Textual Lexical Diversity (MTLD) and hypergeometric distribution diversity (HDD) correct for bias introduced when comparing texts of different lengths, thereby being more robust measures of lexical richness than TTR-based ones. In addition, we used Hapax Legonema, which refers to the number of words appearing only once in the text, and Yule’s K, considered to be highly reliable as a text-length-independent measure. The different metrics of lexical richness were extracted using the LexicalRichness Python version 0.5.1 library (https://pypi.org/project/lexicalrichness/, accessed on 1 December 2025).
Figure 4 illustrates the values obtained for the mentioned lexical richness measures. As metrics are affected by the length of texts, the first image shows the distribution of the number of words in the tutor responses. The difference in this aspect between responses that provide guidance and those that do not is minimal (means are 24.29% and 22.61%, respectively) although statistically significant (with p value of <0.05). It is also possible to observe the presence of numerous outliers: excessively long responses not providing guidance; i.e., many lengthy answers fail to contribute to clarifying the student’s problem.
TTR, the number of unique terms as a function of the text length, shows a higher number of these terms in responses that do not provide good guidance, which would imply richer texts. However, as can be observed in the subsequent images, the results of TTR-derived metrics are inconclusive. HDD and Herdan’s index are in agreement with TTR, while MTLD, Dugast’s index, and Hapax Legonema showed a different tendency. Yule’s K, as mentioned, is a length-independent measure. In most metrics then, the lexical richness of responses providing good guidance seems to be higher, probably due to the need for a better explanation of the issue at hand.
Yule’s K scores for individual tutors are shown in Figure 5. Except for Llama31405B, both LLM-based and human tutors exhibit more lexical diversity when providing valuable guidance. A salient behavior is that of Novice tutors, in which the difference in the vocabulary employed according to the type of guidance is one of the highest; i.e., they use distinct vocabulary when trying to explain their responses.
In terms of RQ1 and RQ2, the following conclusions can be drawn from the analysis of readability and lexical richness dimensions:
RQ1:
readability seems to be inversely proportional to the quality of the guidance that is provided. The difficulty in understanding an answer when it is a supportive one is higher, and it probably involves more mathematical terms and requires deeper reading and understanding. The results regarding lexical richness, on the other hand, despite showing some evidence that a richer vocabulary is used when better guidance is provided, remain inconclusive.
RQ2:
there is a notable difference in the readability of LLMs and human tutors, as can be deduced from the previous results. Both expert and novice tutors’ utterances are more readable and require less instruction to be understood regardless of whether they are providing high-quality guidance or not. This implies that, for the same dialogue, teachers can provide more adapted answers, fostering the student’s comprehension and, consequently, learning. In lexical richness, there is no noticeable difference between the two.
In terms of implications for educational practice, reducing lexical richness during, for example, the introduction of new mathematical notions can minimize cognitive load, thereby reducing potential obstacles at moments when new concepts must be anchored to prior ones. At later stages, such as during familiarization with the notions, contextualization, and related processes, this lexical richness can be consolidated toward the level of mathematical formality expected for the students’ grade level.

4.2. Sentiment, Emotions, and Personality

In this subsection, RQ2.1 and RQ2.2 are examined, with an emphasis on how LLM-powered tutors’ responses manifest sentiment, emotions, and personality traits throughout the dialogue. Sentiment analysis of textual tutor responses aims to assess human-likeliness of AI-powered tutors, particularly in comparison with human tutors. Initially, sentiment orientation is considered, which refers to the task of determining the polarity of a piece of text (tutor response), specifically identifying whether the content expresses a positive, negative, or neutral opinion according to its content.
Figure 6 shows the sentiment extracted from responses in relation to their level of guidance. VADER provides positive, negative, and neutral scores as ratios for proportions of text that fall in each category (all adding up to be 1). A compound score is also computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between −1 (most extreme negative) and +1 (most extreme positive). In this way, the compound metric provides a single unidimensional measure of sentiment for a given sentence, where positive, negative, and neutral scores are useful to analyze the context and presentation of how sentiment is conveyed or embedded in the rhetoric of a given sentence. Within a response, a tutor may embed strongly positive or negative sentiment within varying proportions of neutral text while still conveying a similar overall (compound) sentiment.
Contrary to what could have been expected, the responses classified as No, those not provided appropriate guidance, are the ones transmitting a more positive sentiment and, at the same time, are less neutral. Negative sentiment is low in all types of responses, which is natural as LLMs are trained from texts existing on the Web, which would rarely include negativity in this domain. Moreover, generative AI models tend to be compliant and persuasive, frequently reinforcing the user perspective rather than challenging its accuracy.
Compound sentiment, as a consequence of this, conveys a more positive sentiment in responses that do not provide good guidance. Observing some of the tutor responses providing guidance (classified as Yes) with a low positive sentiment score, we can see that they include an indication to mistakes; for example: “I see what you did there, but there seems to be a problem" or “You’re on the right track, but the mistake lies in your calculation…”. The conjunction but present in the examples typically introduces a contrast or limitation that mitigates or even contradicts what was stated before. Even when the first part of a sentence is positive, the word but signals to the reader that a limitation or critique follows. In the analysis of sentiment, the initial clause is positive, yet the use of but introduces a negative evaluation that diminishes the overall positivity of the statement. On the other hand, responses that do not provide guidance likely omit pointing out the mistake, thereby preserving only the positive aspect of the statement. Adding to this, some examples that can help to clarify this issue are included in the dataset along with some notes left by the annotators of why they do not consider the response as one offering correct guidance. The answer “Great job! Remember to always double-check your answers for accuracy. Would you like to try another problem or review any concepts?” might carry a positive sentiment, but the annotator considers that “The assistant’s role here is to encourage and guide the student, not to provide direct answers”. Likewise, the response “Great job recalling the subtraction process; let’s now focus on understanding the concept behind it”, also a positive-like response, is classified as No as “The assistant’s role here is to acknowledge effort of the student and guide them towards deeper comprehension”. To sum up, it can be inferred that positive sentiment is more related to encouragement rather than providing effective guidance. In fact, all responses are in a highly positive tone, which is only diminished when a mistake is recognized by tutors and some advice is given to further work on the problem.
Beyond polarity, Figure 7 shows the presence of different emotions in the text of responses. The first two images correspond to the frequency of positive and negative emotions extracted based on LIWC lexicon. In line with the previous results, positive emotions appear more frequently in responses not providing guidance than in those that effectively do. While negative emotions rarely appear in texts, they show up slightly more often in responses classified as No. In fact, when particular emotions are observed using the 28 emotions identified by EmoRoBERTa, only some positive emotions like surprise, excitement, realization, admiration, and optimism emerged, while a negative emotion such as disapproval appears equally in all types of responses. Examples of responses with positive emotion but failing to provide adequate guidance are “That’s a great catch! It’s important to always double-check that we’re using the right numbers in our calculations” (high score for surprise) or “You did a great job of recognizing that we need to use multiplication to answer this question! Can you summarize what you need to multiply together?” (high score for recognition). Instead, disapproval is exhibited in an answer as “I can see where you’re trying to break down the problem into steps, but there seems to be a misunderstanding”, which is expected given the choice of wording. Optimism shows an interesting result as the median is higher in the answers that do not provide adequate guidance. Again, this reinforces the idea that such answers have the goal of cheering up the student, transmitting optimism, rather than pointing out the mistakes and helping the student to work through the problem.
Personality is one of the primary factors that influence human behavior as it can moderate how people react, behave, and interact with other individuals. Thus, it becomes relevant to determine the human-likeness of AI tutors in relation to the personality traits they exhibit. The commonly used Big Five personality traits divide personality into five dimensions: agreeableness (refers to being sympathetic, cooperative, and helpful towards others), extraversion (refers to being outgoing, friendly, assertive, and energetic), openness (refers to being curious, intelligent, and imaginative), conscientiousness (refers to being organized, persevering, disciplined, achievement-oriented, and responsible), and neuroticism (refers to being anxious, insecure, moody, and sensitive).
A model trained to assess the Big Five personality traits revealed in a given text/sentence (https://huggingface.co/KevSun/Personality_LM, accessed on 1 December 2025) was used for extracting features modeling this aspect of tutors. The model was fine-tuned using a specially curated dataset tailored for personality traits, enabling it to learn associations between specific textual inputs and distinct personality characteristics [38].
Figure 8 depicts the personality traits shown by tutors according to the type of responses they provide. Agreeableness and openness seem to be more prevalent in texts providing guidance, while extraversion is higher in those that do not. Again, extraversion can be related to cheering up students more than actually guiding them. Neuroticism is equally present regardless of the type of guidance. Agreeableness, which refers to being cooperative and helpful towards others, is more present in texts providing guidance to some extent. The openness trait, which is related to being curious, intelligent, and imaginative, appears considerably less in those answers that do not provide guidance.
Rather than focusing solely on the relationship between different personality traits and the guidance provided, it is worth exploring the traits exhibited by LLM-based tutors in comparison with those of human tutors. Figure 9 shows the scores of agreeableness and extraversion for the seven LLM-based as well as expert and novice tutors. As is clear from the figure, human tutors are more extraverted, particularly novice ones, whereas LLMs have similar levels of extraversion. On the other hand, human tutors are less pleasing than LLM-based ones, as can be inferred from the levels of agreeableness. In both cases, novice tutors are the ones with more salient behavior in both traits. However, the number of answers from novice tutors is lower in the dataset, which prevents a direct comparison.
To answer research questions RQ1 and RQ2 based on the analysis of the sentiment, emotions, and personality-related features, it is possible to say that
RQ1:
sentiment and emotion analysis offers a perspective of human-likeliness of AI-powered tutors. From the obtained results, it can be deduced that AI tutors tend to be positive on their interventions in every case. However, the need to indicate errors and offer adequate guidance drives them toward less positive writing. Likewise, emotions are generally positive, although they are less intense in responses that provide good guidance. This is also reflected in personality traits: responses that do not provide guidance display lower agreeableness and, more prominently, lower openness.
RQ2:
personality traits extracted from the responses of LLM-based tutors differ from those obtained from human tutors. LLMs exhibit similar scores across the two mentioned traits, whereas human tutors tend to be less agreeable and more extraverted in their responses. A possible explanation is that LLMs are trained and aligned to be user-pleasing, which may account for the consistency observed in their personality trait profiles. In contrast, the responses of human tutors tend to convey more authentic human-like personality features, distinguishing them from the more uniform LLM-generated responses.
The teaching and learning of mathematics do not refer solely to the construction or reconstruction of knowledge. Emotional states such as frustration, uncertainty, or other negative emotions affect students’ motivation and willingness to learn. This situation becomes more pronounced among students, for instance at the tertiary or higher-education level, particularly among those who do not choose mathematics-related degree programs. Being able to identify these states is key to the actions and decisions of mathematics teachers.

4.3. Coherence and Uptake

In this final subsection, the degree to which coherence and uptake are present in LLM-powered dialogues is examined, considering their potential contribution to reducing cognitive load and fostering the development of metacognitive skills. Coherence is an important factor mentioned among the pedagogical aspects to be observed in good tutor responses. According to [6], coherence is shown when the response naturally follows up on the previous utterance and context and has no logical conflicts with such context. Aligned with this idea, ref. [3] postulates that a high-quality tutor’s response should be logically consistent with the student’s previous responses.
A first approach to measuring the coherence of a response with respect to the previous dialogue is assessing their similarity since some degree of repetition with respect to the last utterance can be assumed. In this work, the semantic text similarity between the dialogue and the tutor response is used as a metric to analyze this aspect of responses. Sentence similarity, or semantic textual similarity, is a measure of how similar two pieces of text are, or to what extent they express the same meaning. Sentence similarity models convert input texts into vectors (embeddings) that capture semantic information and calculate how close (similar) they are to each other. SentenceTransformers [39] provides a framework for using state-of-the-art semantic similarity pre-trained models. One of the best-performing models for text similarity is all-mpnet-base-v2, so this model was used to compute the similarity of all dialogue–response pairs. Considering that text as related to math, a model specifically designed for computing similarities of short mathematical texts, Bert-MLM_arXiv-MP-class_zbMath [40], was considered.
Beyond similarity, uptake in conversation takes place when a speaker builds upon, acknowledges, repeats, or reformulates a previous contribution. In education, high uptake has been defined as cases where the teacher follows up on the student’s contribution via a question or elaboration [33]. It serves several linguistic and social functions, such as creating coherence between two utterances, which helps to structure the discourse, serving as a mechanism for grounding (i.e., demonstrating understanding of the interlocutor’s contribution by accepting it as part of the common ground) and promoting collaboration. There are more strategies to uptake than simple similarity, including asking questions, providing elaborations, repetition, collaborative completion, or acknowledgment. In [33], a model (https://github.com/ddemszky/conversational-uptake, accessed on 1 December 2025) was trained for next-utterance classification by recruiting domain experts (math teachers and raters trained in classroom observation) to annotate the uptake of teachers to student contributions.
Figure 10 shows the scores of text similarity, math similarity, and uptake calculated using the models mentioned before. The three metrics indicate that responses lacking guidance are less similar to the preceding dialogue, a difference that becomes even more pronounced when mathematical content is taken into account in the similarity computation. More saliently, uptake is considerably lower in such responses. Answers that provide guidance, as well as those that do so to some extent, exhibit similarity with the question-answering thread and are strongly in uptake, following up on the utterance in the previous student turn.
Observing the uptake scores of the seven LLM-based tutors as well as human tutors individually, which are shown in Figure 11, it is possible to see some notable differences between human tutors and most LLMs. Most LLM tutors demonstrate a rather constant uptake in responses, a little less for answers not providing guidance in Sonnet and Gemini, but in all cases a high level of uptake. Conversely, human tutors and Phi3 show a significant gap in uptake. In the responses of Expert and Novice tutors, the difference is clear, denoting that uptake is not always present, and, in such cases, the answers are not useful. In this regard, Phi3 behaves similarly to human tutors, only showing uptake in some responses.
Focusing on the presence of cognitive mechanisms operationalized through LIWC lexicon, it is possible to infer some form of analytical thinking involved in the tutor responses. The cognitive categories that can be extracted from words using this lexicon are insight (think, realize, know, and consider), causation (because, effect, and based), discrepancy (should, would, hope, and lack), tentativeness (maybe, perhaps, and wonder), certainty (always, never, and clearly), and differentiation (despite, although, except, but, and else).
Figure 12 depicts the values obtained for five cognitive processing categories of interest in this analysis. With respect to the category insight, it is possible to observe that the mean values are similar for all types of guidance. It could have been expected that answers providing guidance would reinforce this type of cognitive process. However, examples of responses that were not provided appropriate guidance also exist. Examples are answers such as “Your answer is a little bit off. Please recheck your answer.”, which is classified as not providing guidance, or “To solve a math problem effectively, first understand the question, identify the relevant information, choose the appropriate method, and then carefully work through the steps to find the solution.”, which is classified as guidance to some extent.
The differentiation category captures the use of words expressing contrast or distinction between ideas, which appears more prominently in responses that provide guidance, even when such guidance is only partial. A subtle variation appears in the category discrepancy, which reveals a gap between what is and what should be. An example of positive guidance expressing discrepancy is “You should have done her commission subtract her personal need amount”. Once again, this category appears more prominently in texts that provide some form of guidance.
Lastly, the two remaining LIWC categories concerning cognitive processes are tentativeness and certainty. The former is associated with expressions of uncertainty, whereas the latter reflects confidence, conviction, and assured knowledge. Interestingly, in all cases, tutors respond with the same levels of certitude, which is relevant because LLMs are designed to generate responses that sound confident and convincing even when the underlying knowledge may be incomplete or uncertain. Conversely, responses offering meaningful guidance incorporate a certain level of uncertainty, which may encourage the student to critically evaluate and revise their work.
To answer research questions RQ1 and RQ2 based on the analysis of the similarity, uptake, and cognitive processes appearing in the responses of tutors, it is possible to say that
RQ1:
Coherence, denoted by both the similarity between the ongoing dialogue (textual and math-related) and the uptake of the response, is a clear differential factor of answers providing guidance. Specifically, uptake appears to be a robust marker of answers that reflect attention to this pedagogical dimension. In addition, the appearance of differentiation and discrepancy, and to some extent tentativeness introduced by tutors, is indicative of answers intending to help students.
RQ2:
Human tutors also have distinguishable behavior in terms of uptake with respect to LLM-based tutors as they are more volatile in this aspect, showing greater variability and providing poor guidance when uptake is absent.
Regarding coherence and uptake, the alignment of responses in the dialogue between students and the tutor has significant implications for educational practice. In a mathematics classroom, teachers must continuously incorporate students’ prior knowledge. In fact, didactic theories specific to the field of mathematics, such as the theory of didactic situations, take students’ prior knowledge as the starting point for the new knowledge to be taught. Thus, uptake becomes imperative in the process of building such knowledge.

5. Conclusions

The rapid advancement of LLMs, which enable seamless interaction through natural language, has driven the development of AI-driven tutoring systems designed to increase student engagement and improve learning outcomes. In the field of mathematics education, two main risks associated with LLM-based tutors need to be taken into account when engaging with actual students. The first issue to consider is associated with the probabilistic nature of LLMs and the likelihood of providing incorrect answers from a mathematical point of view. The second issue relates to the pedagogical capabilities of AI tutors powered by standard LLMs. The pedagogical dimension of tutor–student interactions is crucial for fostering meaningful learning experiences, particularly in a subject like mathematics, where students often encounter significant challenges. Despite its importance, there is a gap between AI capabilities and pedagogical needs. As pointed out by [41], this gap is usually exacerbated by the limited collaboration between AI developers and educators, resulting in tools that are technically advanced but educationally misaligned.
This work leverages NLP techniques to characterize AI-powered tutor responses in terms of a specific pedagogical aspect: the guidance provided in math dialogues for solving problems or addressing student confusion. Among these characteristics are the readability and lexical richness of responses, human-like features such as the sentiment, emotions, and personality traits that they convey, and the level of coherence and uptake that they are capable of while interacting with students. Some insights from this study also make it possible to observe differences in the performance of LLM-based and human tutors when responding to the same dialogues.
The contributions of this work include insights into the characteristics of texts generated by LLMs within the context of mathematical dialogues in relation to the principles identified in the literature as relevant for AI tutors exhibiting sound pedagogical practices. The findings of this study highlight the importance of uptake as a means of achieving appropriate guidance so that mathematical knowledge can be gradually constructed in line with teacher support. It was also observed that LLMs often produce responses that do not match the student’s level of understanding, tending to be more complex than students can comfortably read. This underscores the uniquely human ability to tailor explanations to a learner’s needs and emphasizes the importance of replicating this capacity in the development of chatbots. The human-like features in the responses did not seem strongly connected to the level of guidance offered; instead, they were more closely tied to motivational effects, even when the guidance itself was not particularly effective.
This research provides key findings about both the strengths and weaknesses of LLM-based tutors as a starting point for the development of more pedagogically oriented tutors. The results of this study have direct practical implications in the design of AI-powered tutors that, while still based on current state-of-the-art LLMs, can integrate pedagogical considerations more thoroughly. The study also had some limitations that hinder the generalization of its conclusions. First, the study extensively depends on the data chosen for experimentation. Second, the study is centered on mathematics dialogues. Although only a few recent benchmarks have been developed for this purpose, due to the difficulty of collecting and annotating data, future work will aim to validate the identified patterns in other domains or educational fields. Thus, the results can contribute to bridging the gap between LLM-based chatbot development and educational practice.

Author Contributions

Conceptualization, V.P., A.C. and D.G.; methodology, V.P., A.C. and D.G.; software, D.G.; validation, V.P., A.C. and D.G.; formal analysis, V.P., A.C. and D.G.; investigation, V.P., A.C. and D.G.; resources, V.P., A.C. and D.G.; data curation, D.G.; writing—original draft preparation, V.P., A.C. and D.G.; writing—review and editing, V.P., A.C. and D.G.; visualization, V.P., A.C. and D.G.; supervision, V.P., A.C. and D.G.; project administration, V.P., A.C. and D.G.; funding acquisition, V.P., A.C. and D.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) under grant PIP 2023-2025 No. 11220220100429CO.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in BEA 2025 Shared Task at https://sig-edu.org/sharedtask/2025 (accessed on 1 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Parra, V.; Sureda, P.; Corica, A.; Schiaffino, S.; Godoy, D. Can Generative AI Solve Geometry Problems? Strengths and Weaknesses of LLMs for Geometric Reasoning in Spanish. Int. J. Interact. Multimed. Artif. Intell. 2024, 8, 65–74. [Google Scholar] [CrossRef]
  2. Pal Chowdhury, S.; Zouhar, V.; Sachan, M. AutoTutor meets Large Language Models: A Language Model Tutor with Rich Pedagogy and Guardrails. In Proceedings of the 11th ACM Conference on Learning @ Scale (L@S’24), Atlanta, GA, USA, 18–20 July 2024; pp. 5–15. [Google Scholar] [CrossRef]
  3. Maurya, K.K.; Srivatsa, K.A.; Petukhova, K.; Kochmar, E. Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, Albuquerque, NM, USA, 29 April–4 May 2025; pp. 1234–1251. [Google Scholar]
  4. Tack, A.; Piech, C. The AI Teacher Test: Measuring the Pedagogical Ability of Blender and GPT-3 in Educational Dialogues. arXiv 2022, arXiv:2205.07540. [Google Scholar] [CrossRef]
  5. Tack, A.; Kochmar, E.; Yuan, Z.; Bibauw, S.; Piech, C. The BEA 2023 Shared Task on Generating AI Teacher Responses in Educational Dialogues. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), Toronto, ON, Canada, 13 July 2023; pp. 785–795. [Google Scholar]
  6. Macina, J.; Daheim, N.; Chowdhury, S.P.; Sinha, T.; Kapur, M.; Gurevych, I.; Sachan, M. MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 5602–5621. [Google Scholar] [CrossRef]
  7. Wang, R.; Zhang, Q.; Robinson, C.; Loeb, S.; Demszky, D. Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; pp. 2174–2199. [Google Scholar]
  8. Caccavale, F.; Gargalo, C.L.; Gernaey, K.V.; Krühne, U. Towards Education 4.0: The role of Large Language Models as virtual tutors in chemical engineering. Educ. Chem. Eng. 2024, 49, 1–11. [Google Scholar] [CrossRef]
  9. Han, J.; Yoo, H.; Myung, J.; Kim, M.; Lim, H.; Kim, Y.; Lee, T.Y.; Hong, H.; Kim, J.; Ahn, S.Y.; et al. LLM-as-a-tutor in EFL Writing Education: Focusing on Evaluation of Student-LLM Interaction. In Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U), Miami, FL, USA, 16 November 2024; pp. 284–293. [Google Scholar] [CrossRef]
  10. Rajala, J.; Hukkanen, J.; Hartikainen, M.; Niemelä, P. Call me Kiran—ChatGPT as a Tutoring Chatbot in a Computer Science Course. In Proceedings of the 26th International Academic Mindtrek Conference (Mindtrek ’23), Tampere, Finland, 3–6 October 2023; pp. 83–94. [Google Scholar] [CrossRef]
  11. Kahl, S.; Löffler, F.; Maciol, M.; Ridder, F.; Schmitz, M.; Spanagel, J.; Wienkamp, J.; Burgahn, C.; Schilling, M. Evaluating the Impact of Advanced LLM Techniques on AI Lecture Tutors for Robotics Course. In AI in Education and Educational Research; Springer: Cham, Switzerland, 2025; pp. 149–160. [Google Scholar]
  12. Lieb, A.; Goel, T. Student Interaction with NewtBot: An LLM-as-tutor Chatbot for Secondary Physics Education. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’24), Honolulu, HI, USA, 11–16 May 2024. [Google Scholar] [CrossRef]
  13. Gupta, A.; Reddig, J.; Calò, T.; Weitekamp, D.; MacLellan, C.J. Beyond Final Answers: Evaluating Large Language Models for Math Tutoring. In Artificial Intelligence in Education; Springer: Cham, Switzerland, 2025; pp. 323–337. [Google Scholar]
  14. Mirzadeh, S.I.; Alizadeh, K.; Shahrokhi, H.; Tuzel, O.; Bengio, S.; Farajtabar, M. GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. In Proceedings of the 13th International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
  15. Jiang, B.; Xie, Y.; Hao, Z.; Wang, X.; Mallick, T.; Su, W.J.; Taylor, C.J.; Roth, D. A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 4722–4756. [Google Scholar] [CrossRef]
  16. Shojaee, P.; Mirzadeh, I.; Alizadeh, K.; Horton, M.; Bengio, S.; Farajtabar, M. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. arXiv 2025, arXiv:2506.06941. [Google Scholar] [CrossRef]
  17. Chudziak, J.A.; Kostka, A. AI-Powered Math Tutoring: Platform for Personalized and Adaptive Education. In Proceedings of the 26th International Conference on Artificial Intelligence in Education (AIED 2025), Palermo, Italy, 22–26 July 2025; pp. 462–469. [Google Scholar]
  18. Kestin, G.; Miller, K.; Klales, A.; Milbourne, T.; Ponti, G. AI tutoring outperforms in-class active learning: An RCT introducing a novel research-based design in an authentic educational setting. Sci. Rep. 2025, 15, 17458. [Google Scholar] [CrossRef] [PubMed]
  19. Chi, M.T.H.; Wylie, R. The ICAP Framework: Linking Cognitive Engagement to Active Learning Outcomes. Educ. Psychol. 2014, 49, 219–243. [Google Scholar] [CrossRef]
  20. Scarlatos, A.; Liu, N.; Lee, J.; Baraniuk, R.; Lan, A. Training LLM-Based Tutors to Improve Student Learning Outcomes in Dialogues. In Proceedings of the Artificial Intelligence in Education, Palermo, Italy, 22–26 July 2025; pp. 251–266. [Google Scholar]
  21. Macina, J.; Daheim, N.; Hakimi, I.; Kapur, M.; Gurevych, I.; Sachan, M. MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors. arXiv 2025, arXiv:2502.18940. [Google Scholar]
  22. Kochmar, E.; Maurya, K.; Petukhova, K.; Srivatsa, K.A.; Tack, A.; Vasselli, J. Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), Vienna, Austria, 31 July–1 August 2025; pp. 1011–1033. [Google Scholar] [CrossRef]
  23. Hikal, B.; Basem, M.; Oshallah, I.; Hamdi, A. MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors. In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), Vienna, Austria, 31 July–1 August 2025; pp. 1194–1202. [Google Scholar] [CrossRef]
  24. An, J.; Fu, X.; Liu, B.; Zong, X.; Kong, C.; Liu, S.; Wang, S.; Liu, Z.; Yang, L.; Fan, H.; et al. BLCU-ICALL at BEA 2025 Shared Task: Multi-Strategy Evaluation of AI Tutors. In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), Vienna, Austria, 31 July–1 August 2025; pp. 1084–1097. [Google Scholar] [CrossRef]
  25. Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. Available online: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf (accessed on 1 December 2025).
  26. Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
  27. OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
  28. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
  29. Team, G.; Georgiev, P.; Lei, V.I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024, arXiv:2403.05530. [Google Scholar] [CrossRef]
  30. Abdin, M.; Aneja, J.; Awadalla, H.; Awadallah, A.; Awan, A.A.; Bach, N.; Bahree, A.; Bakhtiari, A.; Bao, J.; Behl, H.; et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv 2024, arXiv:2404.14219. [Google Scholar] [CrossRef]
  31. Gilbert, C.J.; Hutto, E. VADER: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the 8th International Conference on Weblogs and Social Media (ICWSM-14), Ann Arbor, MI, USA, 1–4 June 2014. [Google Scholar]
  32. Kamath, R.; Ghoshal, A.; Eswaran, S.; Honnavalli, P. An Enhanced Context-based Emotion Detection Model using RoBERTa. In Proceedings of the 2022 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), Bangalore, India, 8–10 July 2022; pp. 1–6. [Google Scholar] [CrossRef]
  33. Demszky, D.; Liu, J.; Mancenido, Z.; Cohen, J.; Hill, H.; Jurafsky, D.; Hashimoto, T. Measuring Conversational Uptake: A Case Study on Student-Teacher Interactions. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 1638–1653. [Google Scholar] [CrossRef]
  34. Pennebaker, J.W.; Francis, M.E.; Booth, R.J. Linguistic Inquiry and Word Count: LIWC 2001; Lawrence Erlbaum Associates: Mahway, NJ, USA, 2001. [Google Scholar]
  35. Dubay, W. Smart Language: Readers, Readability, and the Grading of Text; Impact Information: Costa Mesa, CA, USA, 2009. [Google Scholar]
  36. Collins-Thompson, K. Computational assessment of text readability: A survey of current and future research. ITL—Int. J. Appl. Linguist. 2014, 165, 97–135. [Google Scholar] [CrossRef]
  37. McCarthy, P.M.; Jarvis, S. vocd: A theoretical and empirical evaluation. Lang. Test. 2007, 24, 459–488. [Google Scholar] [CrossRef]
  38. Rong Wang, K.S. Continuous Output Personality Detection Models via Mixed Strategy Training. arXiv 2024, arXiv:2406.16223. [Google Scholar] [CrossRef]
  39. Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019. [Google Scholar]
  40. Steinfeldt, C.; Mihaljević, H. Evaluation and Domain Adaptation of Similarity Models for Short Mathematical Texts. In Proceedings of the 17th International Conference on Intelligent Computer Mathematics (CICM 2024), Montreal, QC, Canada, 5–9 August 2024; pp. 241–260. [Google Scholar] [CrossRef]
  41. Sajja, R.; Sermet, Y.; Cwiertny, D.; Demir, I. Integrating AI and Learning Analytics for Data-Driven Pedagogical Decisions and Personalized Interventions in Education. Technol. Knowl. Learn. 2025; online first. [Google Scholar] [CrossRef]
Figure 1. Distribution of guidance classes in the responses of tutors.
Figure 1. Distribution of guidance classes in the responses of tutors.
Information 17 00051 g001
Figure 2. Readability metrics for tutor responses in relation to the guidance provided.
Figure 2. Readability metrics for tutor responses in relation to the guidance provided.
Information 17 00051 g002
Figure 3. Readability scores of individual tutor responses.
Figure 3. Readability scores of individual tutor responses.
Information 17 00051 g003
Figure 4. Lexical richness metrics for tutor responses in relation to the guidance provided.
Figure 4. Lexical richness metrics for tutor responses in relation to the guidance provided.
Information 17 00051 g004
Figure 5. Lexical richness of tutors according to Yule’s K.
Figure 5. Lexical richness of tutors according to Yule’s K.
Information 17 00051 g005
Figure 6. Sentiment analysis of tutor responses in relation to the guidance provided.
Figure 6. Sentiment analysis of tutor responses in relation to the guidance provided.
Information 17 00051 g006
Figure 7. Emotion scores of tutor responses in relation to the guidance provided.
Figure 7. Emotion scores of tutor responses in relation to the guidance provided.
Information 17 00051 g007
Figure 8. Personality traits of tutor responses in relation to the guidance provided.
Figure 8. Personality traits of tutor responses in relation to the guidance provided.
Information 17 00051 g008
Figure 9. Comparison regarding agreeableness and extraversion of LLM-based and human tutors.
Figure 9. Comparison regarding agreeableness and extraversion of LLM-based and human tutors.
Information 17 00051 g009
Figure 10. Distribution of similarities and uptake scores.
Figure 10. Distribution of similarities and uptake scores.
Information 17 00051 g010
Figure 11. Uptake scores for the individual tutors.
Figure 11. Uptake scores for the individual tutors.
Information 17 00051 g011
Figure 12. Cognitive processes appearing in tutor responses in relation to the guidance provided.
Figure 12. Cognitive processes appearing in tutor responses in relation to the guidance provided.
Information 17 00051 g012
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Parra, V.; Corica, A.; Godoy, D. Insights on the Pedagogical Abilities of AI-Powered Tutors in Math Dialogues. Information 2026, 17, 51. https://doi.org/10.3390/info17010051

AMA Style

Parra V, Corica A, Godoy D. Insights on the Pedagogical Abilities of AI-Powered Tutors in Math Dialogues. Information. 2026; 17(1):51. https://doi.org/10.3390/info17010051

Chicago/Turabian Style

Parra, Verónica, Ana Corica, and Daniela Godoy. 2026. "Insights on the Pedagogical Abilities of AI-Powered Tutors in Math Dialogues" Information 17, no. 1: 51. https://doi.org/10.3390/info17010051

APA Style

Parra, V., Corica, A., & Godoy, D. (2026). Insights on the Pedagogical Abilities of AI-Powered Tutors in Math Dialogues. Information, 17(1), 51. https://doi.org/10.3390/info17010051

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop