1. Introduction
The convergence of financial education and artificial intelligence (AI) represents one of the most complex—and at the same time most promising—frontiers in contemporary education. In the context of accelerated digital transformation and growing socioeconomic inequality, the urgency to strengthen financial competencies has intensified, particularly in South America and Latin America, where substantial gaps persist in access to financial services and to high-quality economic education (
Dash & Mohanta, 2025).
The COVID-19 pandemic acted as a catalyst for this shift, accelerating the adoption of digital technologies across both financial and educational ecosystems while exposing structural limitations in traditional approaches to financial literacy. This disruption reshaped the research and practice agenda, generating long-term implications for policy design, teacher preparation, and the technological architecture of learning solutions.
This challenge aligns directly with the 2030 Agenda for Sustainable Development, especially SDG 4 (inclusive and quality education) and SDG 8 (decent work and economic growth). Within this framework, financial education is no longer a peripheral skill set; it becomes a core enabler of productive inclusion, sustainable entrepreneurship, and social mobility, reinforcing the link between technology, human capabilities, and development.
In Latin America and the Caribbean, recognizes AI’s transformative potential to advance productive, inclusive, and sustainable development models, while warning that ethical and governance issues must be addressed decisively. Complementarily, the Latin American AI Index (ILIA, 2024), developed together with CENIA, documents deep inequalities in regional readiness to adopt AI: while Chile, Brazil, and Uruguay lead in research, development, and governance, other countries lag behind, constraining both the scalability and the equity of data-driven educational innovation.
Globally, AI presents a critical paradox: it can democratize access to financial education through adaptive systems, intelligent tutoring, and personalized content, yet it also introduces risks related to equity, transparency, explainability, and digital inclusion (
Buczynski et al., 2025). In Latin America, these tensions are amplified by structural barriers—unequal connectivity, heterogeneous institutional capacities, and data asymmetries—that hinder ethical and effective implementation, especially in vulnerable communities.
Gender gaps constitute a particularly sensitive axis within this debate.
Gisselbaek et al. (
2025) document biases in AI-generated representations of financial leadership, reproducing pre-existing inequalities and shaping expectations, aspirations, and perceived legitimacy. Consequently, any technological intervention in financial education requires inclusive design criteria, bias audits, and accountability mechanisms, not merely as normative commitments but as technical requirements to prevent adverse outcomes.
In academic terms, digital financial literacy has evolved from basic, instrument-focused notions toward more complex models integrating technical, critical, and contextual skills (
De La Rosa & Bechler, 2024;
H. Zhu et al., 2024). In parallel, research in AI in education has broadened its focus to encompass the ethical, pedagogical, and social implications of automation, particularly in sensitive domains such as finance, where decisions and behaviors can be strongly shaped by interfaces, recommendations, and digitally engineered frictions (
Tóth & Blut, 2024).
Over the last five years, the field has also progressed through identifiable phases. Early work emphasized technical applications such as fraud detection, scoring, and the automation of financial processes (
Pattnaik et al., 2024;
Odufisan et al., 2025). Subsequently, educational approaches gained prominence, aiming to personalize financial learning and adapt it to culturally diverse contexts, with greater attention to instructional design, human–AI interaction, and contextual appropriateness (
Akhtar et al., 2024). More recently, scholarship has increasingly highlighted behavioral risks and unintended effects, including the reduction in the “pain of paying” and the encouragement of impulsive consumption, underscoring the urgency of ethical regulation and oversight that is not merely declarative but operational and measurable within deployed systems (
De La Rosa & Bechler, 2024).
In higher education, studies have examined organizational and formative effects associated with AI use.
Zhou et al. (
2025) investigate how AI moderates the relationship between work engagement and teachers’ innovative behavior, offering insights into the environments where financial educators are trained and where technology-mediated teaching strategies are implemented. In vocational education, research likewise emphasizes AI’s potential to enhance equity and accessibility, contributing to SDG 4 and SDG 8 (
Prasetya et al., 2025).
Within the economic-financial domain, evidence points to mixed outcomes that demand context-sensitive interpretations.
Liu (
2025) reports improvements in household asset allocation, employment quality, and economic development; however,
Zhang and Piao (
2025) caution that AI’s effects on corporate financing are complex and contingent on shifts in labor productivity, calling for more contextual analytical frameworks to avoid unwarranted generalizations. From a sustainability perspective,
Tao (
2024) suggests that AI can increase green productivity, while the high energy consumption of AI systems raises environmental dilemmas with clear technical dimensions (infrastructure, computational efficiency, energy footprint).
Financial inclusion from a gender perspective adds a further critical social layer.
Medina-Vidal et al. (
2025) propose credit-risk analysis models that incorporate gender variables, particularly relevant for vulnerable populations, raising questions about how such approaches can be integrated into educational processes without reinforcing stigma and how analytical models can be translated into meaningful and fair learning experiences.
Despite rapid growth, a substantial knowledge gap remains: the literature is fragmented, and there is no integrative synthesis unifying findings on the effectiveness of AI tools in financial education, nor sufficiently robust theoretical frameworks that articulate pedagogical principles with technical capabilities in financial contexts. Moreover, studies examining the ethical and social implications of automation in financially vulnerable settings remain limited, and longitudinal evidence capable of assessing sustained effects over time is scarce. At this point, the methodological evolution of systematic reviews becomes particularly relevant: the field has moved from descriptive trend mapping and thematic aggregation toward computational approaches—such as natural language processing—to analyze scholarly discourse, detect sentiment, and estimate polarity (e.g., optimistic, critical, or ambivalent narratives), thereby strengthening the capacity to interpret not only what is studied but also how evidence is argued and with what interpretive biases.
Against this background, the proposed systematic review critically examines research on the integration of AI into financial education between 2020 and 2025. Although AI applications in finance predate 2020, this time window was selected because it captures a distinct acceleration phase marked by the convergence of three factors: (i) the COVID-19 pandemic, which intensified the digitalization of teaching and learning processes; (ii) the rapid expansion of AI-enabled educational and financial technologies in real-world settings; and (iii) the increasing visibility of ethical, governance, and inclusion concerns associated with AI deployment. Focusing on this period therefore allows the review to analyze the most recent and policy-relevant evidence produced during a major technological and educational transition. The review aims to map applications, identify theoretical foundations, assess reported evidence on educational use, characterize technical and ethical implications, and detect methodological and thematic gaps. In addition, it incorporates natural language processing to analyze sentiment and polarity in the literature, enabling a more nuanced reading of academic positioning and tensions.
3. Materials and Methods
3.1. Study Design and Conceptual Framing
This research was designed as a systematic literature review structured under PRISMA 2020, with the purpose of reconstructing—rigorously, transparently, and reproducibly—the state of the art on the integration of artificial intelligence in financial education. The adoption of a PRISMA-based design was treated not as a stylistic decision but as a methodological requirement, given that the topic remains fragmented across financial education, the social sciences, applied economics, and data science. In such conditions, individual studies are often difficult to compare, cumulative evidence is hard to consolidate, and policy implications risk being driven by anecdotal interpretations rather than systematic patterns. The review therefore functions as an ordering mechanism capable of identifying trends, mapping convergence, exposing evidence gaps, and supporting a forward-looking research agenda grounded in traceable evidence.
The main research question was formulated using explicit delimitation logic. The global environment was defined as the population of interest; AI-based digital tools were considered the main interventions to be evaluated; and improvements in financial education (e.g., financial literacy, capacity, knowledge, or related measurable learning outcomes) were established as the main outcome domain. This conceptual architecture enabled a systematic examination of a complex intersection—financial literacy, digital technology, and economic development—that is particularly relevant in contexts where AI adoption is accelerating faster than institutional and curricular adaptation in education systems.
3.2. Search Strategy and Information Sources
The bibliographic search was conducted systematically across three complementary databases covering the 2020–2025 period. This temporal window was defined to capture the most recent inflection point in the digital-financial ecosystem, when AI—particularly machine learning and generative models—expanded rapidly as both a practical educational tool and an increasingly visible scholarly topic. Scopus was selected as the primary database (n = 189) due to its broad multidisciplinary coverage and consistent indexing of high-impact journals in economics, finance, and the social sciences. ScienceDirect contributed 186 records, strengthening access to Elsevier-published literature at the intersection of financial education, AI, ChatGPT-related discussions, and digital transformation. Taylor & Francis Online added 13 records, reinforcing the interdisciplinary perspective given its editorial strength in economics, finance, and technology studies. Taken together, these sources were combined to maximize thematic coverage, reduce indexing bias, and broaden the diversity of academic viewpoints represented in the initial corpus.
Search strings were adapted to the syntactic rules of each platform while preserving conceptual equivalence across databases. For Scopus, the query was:
TITLE-ABSTR-KEY (“financial education” OR “financial literacy” OR “financial capability” OR “financial knowledge”) AND (“artificial intelligence” OR “AI” OR “machine learning” OR “chatGPT” OR “generative AI” OR “intelligent systems”) AND (“digital era” OR “digital transformation” OR “digital technologies”)
For ScienceDirect, the query was: TITLE-ABSTR-KEY (“financial education” OR “financial literacy” OR “financial capability” OR “financial knowledge”) AND (“artificial intelligence” OR “AI” OR “machine learning” OR “generative AI” OR “chatGPT”) AND (“digital era” OR “digital transformation” OR “digital technologies”)
For Taylor & Francis Online, the query was: TITLE-ABS-KEY (“financial education” OR “financial literacy” OR “financial capability” OR “financial knowledge”) AND (“artificial intelligence” OR AI OR “machine learning” OR “chatGPT” OR “generative AI”) AND (“digital era” OR “digital transformation” OR “digital economy” OR “digital skills”)
This design ensured that retrieval remained anchored to three invariant conceptual blocks—financial education constructs, AI constructs, and digital transformation context—while allowing platform-specific constraints to be respected.
3.3. Eligibility Criteria and Selection Logic
Eligibility criteria were defined to ensure direct thematic relevance and a minimum threshold of scientific quality. The studies included were limited to peer-reviewed journal articles indexed in recognized databases, available in full text, written in English or Spanish, and explicitly addressing the relationship between financial education and artificial intelligence. To ensure empirical grounding, only studies with empirical designs were included, excluding editorials, opinion pieces, and purely speculative contributions. The geographical scope is global, and the review excluded literature published before 2019 to preserve conceptual contemporaneity with the rapid evolution of AI-based educational tools.
Exclusion criteria were applied to prevent peripheral texts from diluting the interpretation. Duplicate records were eliminated using Python 3.12 to validate duplicates as a first filter with DOIs, keeping only unique ones. As a second filter, duplicates were eliminated based on the title concatenated with the first author and journal, allowing studi es with exact titles to be included as long as they were from different authors and journals. Another area of deletion was when methodological rigor was not verifiable or clearly described, when the basic disciplinary anchor was outside economics, finance, and social sciences, or when articles lacked an explicit link to the selected keyword areas (artificial intelligence, financial education, ChatGPT/generative AI, and digital transformation). Studies without full access or a verifiable peer-review status were also excluded to protect the integrity of the empirical base.
3.4. Selection Process Under PRISMA 2020
The selection process followed a documented and traceable sequence, consistent with the PRISMA 2020 flow logic (
Page et al., 2021) A total of 388 records were identified through database searches, without incorporating additional records through manual searches of references or grey literature sources. In the preparation phase prior to selection, duplicates were not removed as they had already been generated initially during data import, as specified in
Section 3.3.
Screening proceeded through title and abstract review of the 388 records using the preliminary eligibility criteria. In this stage, 42 records were excluded due to insufficient thematic relevance, misalignment with geographic scope, incompatible document type, or mismatch with the defined time frame. After screening, 346 reports were retrieved for full-text assessment.
Full-text eligibility assessment of the 346 reports resulted in multiple exclusion categories aligned with the operational logic of the review. A substantial set of full-text articles (296) was excluded for failing to meet the defined study-type criteria, such as theoretical papers without empirical basis or narrative reviews lacking systematic methodology. Additional exclusions were applied to protect temporal contemporaneity and disciplinary fit, including removals for not meeting the 2020-onward focus, for not centering economics/finance/social sciences as the analytical domain, for incompatible research article types, for lacking explicit linkage to the defined keyword constructs, and for access-related constraints where open access was required for verification and reproducibility. Ultimately, 50 studies were retained for inclusion in the systematic review (
Figure 1).
3.5. Data Extraction and Quality Appraisal
Data extraction was conducted through a standardized coding framework designed to capture both bibliographic and methodological comparability across a heterogeneous corpus. For each included study, structured variables were recorded, including country and time period, research design, sample size, core variables analyzed, methods applied, and the effectiveness measures reported. This structure enabled harmonized synthesis despite substantial variation in research traditions and measurement approaches across the included studies.
Methodological quality appraisal was implemented using criteria adapted to the diversity of study designs encountered. Rather than applying a single standardized instrument that might privilege one design family, the appraisal emphasized thematic relevance, clarity of method description, coherence in data analysis, transparency of model development and validation (when applicable), and alignment between methods and conclusions. While the appraisal approach was intentionally pragmatic, it was also methodologically justified by the dispersion of designs, which can introduce bias when tools tailored for one study type are forced onto dissimilar empirical structures.
3.6. Bibliometric Analysis, Sentiment Modeling, and Integrated Evidence Synthesis
The analytical strategy was implemented in two complementary phases that jointly strengthened interpretation. In the first phase, bibliometric mapping was used to identify the thematic structure, convergence zones, and research trends associated with AI and financial education in the digital era. After corpus cleaning, metadata fields (title, authors, year, country, keywords, and references) were extracted and processed using bibliometric software such as VOSviewer 1.6.20, to generate keyword co-occurrence maps and visualize conceptual clusters. This stage served as an orienting framework, allowing the scientific ecosystem to be “read” as a network before interpreting results at the study level and providing cluster-informed categories for organizing narrative synthesis.
In the second phase, a computational pipeline for sentiment and text analytics was implemented to examine not only what is being studied but also how evidence is framed in the language of abstracts. Abstract-level framing was treated as analytically relevant because emerging technology literature frequently mixes empirical claims with rhetorical positioning—optimism, caution, risk emphasis, or ethical concern—that may shape policy interpretation and perceived effectiveness. To support transparency and reproducibility, the pipeline was executed in Python using a structured workflow that began with the ingestion of an Excel corpus of titles and abstracts. Automated column detection was applied to identify title and abstract fields across possible multilingual naming conventions, followed by conservative text cleaning that normalized whitespace artifacts while preserving semantic content.
Language detection was performed at the abstract level to characterize corpus multilinguality and to inform interpretation of lexicon-based sentiment tools. The pipeline then computed descriptive text metrics capturing structural and stylistic properties of abstracts, including character count, word count, sentence count, average word length, average sentence length (in words), lexical diversity, and punctuation-based indicators (e.g., exclamation marks, question marks, commas, semicolons, colons, quotation marks, parentheses). These features provided a baseline profile of writing complexity and expressiveness, enabling the exploration of whether rhetorical intensity or textual structure co-varied with sentiment and stance.
To strengthen validity through triangulation, sentiment was estimated through three complementary approaches. First, VADER sentiment scores were computed (negative, neutral, positive proportions and compound score) as a lightweight lexicon-based proxy, acknowledging its stronger calibration in English and therefore treating its results for Spanish texts with caution. Second, TextBlob polarity and subjectivity were computed to produce interpretable polarity and a subjectivity signal, from which an objectivity proxy (1 − subjectivity) was derived to approximate the degree of rhetorical versus factual framing. Third, a multilingual transformer model (cardiffnlp/twitter-xlm-roberta-base-sentiment, based on XLM-RoBERTa) was implemented to mitigate language bias and provide probabilistic sentiment outputs that are more robust across English and Spanish texts. GPU acceleration was used when available to improve computational efficiency, and label mapping was standardized to ensure consistent interpretation of negative/neutral/positive outputs across model configurations.
Because academic abstracts may exceed transformer token limits, a sentence-aware chunking procedure was applied. Abstracts were split into sentence units and aggregated into chunks constrained by a maximum token budget, with truncation only when strictly necessary. Each chunk was independently scored by the transformer, and abstract-level probabilities were computed as a token-weighted average across chunks. This aggregation prevented short fragments from dominating document-level classification and ensured that longer abstracts were represented proportionally. Empty abstracts were explicitly flagged to avoid default labeling artifacts. A continuous sentiment indicator, sentiment_index, was computed as the difference between transformer probabilities, P(positive) − P(negative), providing a bounded measure interpretable on an approximate −1 to +1 continuum. In parallel, a simple rhetorical intensity proxy, text_energy, combined punctuation signals with normalized word count to quantify expressive emphasis.
To complement sentiment with topic-proximal lexical evidence, the pipeline implemented exploratory term extraction using a CountVectorizer over cleaned abstracts, generating unigram and bigram frequencies with bilingual stopword filtering (English/Spanish). This step enabled the identification of dominant expressions and co-occurring phrases that contextualize sentiment results and support interpretive triangulation between what is discussed and how it is framed.
Finally, the computational outputs were consolidated into reproducible artifacts. Enriched datasets, descriptive summaries, missingness profiles, label distributions, correlation matrices among numeric metrics, top terms, and ranked subsets (e.g., most positive/negative abstracts by sentiment_index) were exported into a multi-sheet Excel file (Resultados_Sentimiento_Abstracts.xlsx). Interactive visual analytics were produced using Plotly 6.6.0 and exported as an HTML dashboard (Dashboard_Sentimiento_Abstracts.html), including sentiment label distributions, sentiment_index histograms, polarity–subjectivity scatterplots colored by transformer classes, cross-method comparisons (e.g., VADER compound across transformer labels), correlation heatmaps, and term-frequency bar charts.
The bibliometric phase and the NLP/text-analytics phase were then integrated through narrative synthesis. Findings were organized into thematic categories derived from bibliometric clusters, and study-level evidence was triangulated with abstract-level sentiment distributions and lexical patterns. This integration enabled a coherent evidence synthesis while acknowledging that methodological heterogeneity across the 50 included studies prevented a quantitative meta-analysis.
3.7. Methodological Limitations
Limitations were classified into search, appraisal, synthesis, and computational constraints. Search limitations included restriction to three databases, language bias toward English and Spanish, lack of systematic manual reference searching, and exclusion of institutional gray literature. Appraisal limitations reflected the absence of a single standardized tool applicable across heterogeneous study designs and the necessarily qualitative nature of cross-design quality assessment. Synthesis limitations stemmed from methodological heterogeneity and the absence of comparable effect-size reporting, which prevented meta-analysis and limited statistical assessment of publication bias.
The sentiment and text-analytics pipeline also introduced specific constraints. Lexicon-based tools (VADER, TextBlob) are less reliable in Spanish and were therefore treated as secondary signals and primarily used for triangulation. Transformer-based sentiment, while multilingual, can be sensitive to domain shift because the model is trained on short-form social media text; chunking and token-weighted aggregation mitigated length constraints, yet academic writing conventions may still affect calibration. Term extraction relied on frequency-based n-grams without lemmatization to maximize transparency and reproducibility, which can fragment semantically equivalent variants. Despite these constraints, the multi-method triangulation approach increased robustness by avoiding single-method dependence and by providing convergent indicators across sentiment, structure, and lexical evidence.
3.8. Ethical Considerations
This systematic review did not involve experimentation with humans or animals and did not require the handling of sensitive personal data; consequently, institutional ethics committee approval was not necessary. Nonetheless, principles of transparency, scientific integrity, and proper source attribution were observed in accordance with recognized standards of good scientific practice (
COPE, 2019). All records were handled in compliance with copyright provisions, and citations and references were prepared following the American Psychological Association (APA), 7th edition, guidelines.
4. Results
The bibliometric analysis shows that the literature on artificial intelligence displays a markedly interdisciplinary structure, in which artificial intelligence occupies the central core of the network and articulates multiple domains of application. Its position, size, and density confirm that AI functions as the organizing axis of the field, simultaneously connecting discussions on education, machine learning, finance, health, management, and emerging technologies. Within this framework, financial education does not appear as an isolated core but rather as a thematic line embedded within a broader ecosystem of applied AI research.
Figure 2 presents a keyword co-occurrence network generated with VOSviewer, in which the scientific output is organized into several partially overlapping clusters. The node artificial intelligence acts as the main point of convergence and maintains visible connections with terms such as machine learning, education, students, engineering education, humans, finance, financial management, and ChatGPT, reflecting the cross-cutting nature of the field and its expansion into multiple areas of knowledge.
One of the most visible areas of the network groups terms linked to the educational domain, including education, students, higher education, decision-making, teaching, and curricula, in close relationship with computational concepts such as machine learning, deep learning, and natural language processing. This configuration suggests that an important portion of the literature examines how AI is transforming teaching, learning, and instructional support processes. It also shows that AI-related education is not limited to a single thematic content area but extends across diverse training contexts, including technical, university, and professional education.
Another group of terms connects AI with the financial and technological domains through nodes such as finance, fintech, financial management, financial markets, financial sectors, information management, blockchain, big data, internet of things, and cybersecurity. This pattern supports the interpretation that artificial intelligence is studied not only as a pedagogical tool but also as part of a broader digital transformation infrastructure affecting management, markets, and financial services. From this perspective, financial education should be understood in relation to this wider environment of digitalization, automation, and algorithm-mediated decision-making.
The network also incorporates a clearly human-centered and health-related component, reflected in the presence of terms such as humans, article, clinical practice, medical research, patient education, health care access, health care policy, health care personnel, telemedicine, and personalized medicine. The visibility of these nodes suggests that the educational function of AI is not restricted to the classroom or to financial literacy but also unfolds in clinical, institutional, and professional settings where knowledge transfer, training, and user interaction play a central role. Consequently, the educational dimension of AI appears integrated into a broader applied framework in which learning, communication, and decision support converge.
Likewise, terms associated with generative AI and conversational systems are also present, especially ChatGPT, chatbot, chatbots, and large language models. These nodes indicate that the literature is increasingly incorporating language-based tools for tutoring, guidance, content generation, and user support. However, their position within the network suggests that this remains an emerging line of research that is still articulated with more established areas such as education, machine learning, health, and management, rather than constituting the dominant thematic center on its own.
Figure 3 adds a temporal dimension to the analysis by showing the relative evolution of terms within the network. The color distribution suggests a shift from themes associated with health, pandemic conditions, and institutional adaptation toward more recent interest in generative AI, language models, and applications centered on educational and human interaction. Among the earlier nodes are pandemic, COVID-19, coronavirus disease 2019, telemedicine, and several terms linked to health systems, reflecting a period in which the adoption of AI-based solutions was accelerated by the need for continuity, virtualization, and response to disruptive scenarios.
In an intermediate position are broad and structuring terms such as artificial intelligence, machine learning, education, students, humans, engineering education, finance, and financial management, which function as bridges across different domains. Their centrality indicates that the literature progressively consolidated a shared vocabulary to describe processes of digital transformation in education, health, management, and finance. Rather than a closed field focused on a single theme, the network reveals a common conceptual platform from which diverse sectoral applications are being developed.
The most recent nodes are concentrated around terms such as ChatGPT, large language models, large language model, adult, male, female, controlled study, and financial literacy. This configuration suggests that the most recent literature is moving toward generative AI tools, conversational interfaces, and more specific application settings, including empirical studies with delimited populations. The appearance of financial literacy within this more recent layer indicates that financial education is gaining visibility within the scientific conversation, although it still remains embedded in a network dominated by broader AI, education, and human-centered application terms.
This temporal pattern allows two main trends to be identified. First, the field appears to have moved from a phase marked by the urgency of digitalization and virtualization under crisis conditions toward a stage dominated by discussions on language models, chatbots, and intelligent interaction. Second, financial education emerges as a growing line of interest, but one that is still linked to adjacent areas such as higher education, decision-making, finance, and professional training, rather than as a fully consolidated and autonomous subfield.
Figure 2 and
Figure 3 depict a literature that is dynamic, interdisciplinary, and expanding, in which artificial intelligence operates as the central hub connecting multiple research agendas. The convergence of machine learning, education, health, finance, and generative AI suggests that the field is evolving toward increasingly integrated and user-centered applications. Within this structure, financial education appears as an emerging strand with growth potential, especially insofar as conversational tools and language models expand their possibilities for personalization, decision support, and pedagogical mediation.
Consequently, the natural language processing model of the selected scientific research is developed using prism stratification.
Figure 4 displays the distribution of the sentiment_index (polarity) across the analyzed abstracts and, overall, portrays a field that communicates in a predominantly technical and measured tone. Most values cluster around zero, with the central mass concentrated between −0.0155 and 0.0749 (interquartile range), indicating that the majority of texts carry low affective intensity and rely on a descriptive, analytical register rather than an overtly persuasive one. Consistent with this pattern, the automated tone classification is overwhelmingly neutral.
Table 1 captures a defining feature of the corpus: research on AI and financial education is written in a largely neutral, technical register, yet the direction of sentiment depends on the analytic lens used. On one hand, the sentiment_index shows a slight positive tilt on average, consistent with literature that frequently frames AI as enabling personalization, scalability, and learner support. On the other hand, TextBlob polarity (tb_polarity) trends more negatively, pointing to a more cautious or critical reading. This discrepancy should not be treated as a flaw; it is a meaningful methodological implication. Different sentiment approaches are sensitive to different aspects of academic language. In a technical domain, optimism often appears through terms such as “potential,” “improvement,” and “efficiency,” while caution enters through normative or risk-laden vocabulary (“bias,” “privacy,” “limitations,” “ethical concerns”). Depending on the dictionary, model assumptions, and domain adaptation, the same abstract can be interpreted as mildly optimistic or prudently critical.
This tension is reinforced by the transformer-based class probabilities (trf_neg/trf_neu/trf_pos), which indicate that neutrality dominates. Substantively, that implies the field is not communicating in an emotionally charged way; it is presenting methods, contributions, and problem framings rather than overt advocacy. The persistent presence of a negative component still matters; however, it suggests that governance topics—privacy, fairness, and bias—are not peripheral but embedded as feasibility conditions, even if they rarely translate into strongly affective phrasing.
At the same time, the relatively high subjectivity estimate implies that many abstracts include interpretive or prospective formulations (e.g., “suggests,” “promising,” “could improve”), despite maintaining an overall neutral tone. For your review, the implication is important: the corpus often mixes empirical reporting with expectation-setting and forward-looking claims. This, in turn, strengthens the argument that the field needs more designs that reduce inferential ambiguity—more controlled evaluations, clearer reporting standards, and comparable outcome metrics—so that discourse can move from plausibility and promise toward demonstrated effectiveness in sustained financial outcomes.
The writing-profile indicators (text length and sentence structure) complement the content reading: abstracts are, on average, relatively long and sentence-heavy, which is typical of interdisciplinary work that must simultaneously define concepts, justify methods, and position contributions across education, finance, and computing. Lexical diversity and average word length are consistent with a technically varied vocabulary, suggesting the phenomenon is being narrated as a “systems” problem rather than a narrow classroom issue or a pure fintech application. The implication is that progress requires integrative frameworks that align pedagogy (how learning happens), technology (how models recommend and adapt), and governance (how systems are audited and protected).
The combination of “text energy,” variability across polarity measures, and the spread in positive/negative probabilities suggests a heterogeneous discursive landscape: a subset of abstracts adopts clearly opportunity-oriented framing, while another subset emphasizes constraints and risks. For the discussion section, this supports a useful synthesis: the field is not polarized by ideology but differentiated by implementation assumptions. The core question is not whether AI “works” in general, but under what technical and pedagogical conditions it yields real, equitable, and sustainable impact. Accordingly,
Table 1 reinforces the need to evaluate AI in financial education not only through usability or model performance but also through robustness, explainability, fairness, data protection, and medium- to long-term behavioral outcomes.
Figure 5 works as a “coherence map” between two dimensions that repeatedly intersect throughout this study: on the one hand, sentiment-related signals (polarity/neutrality/negativity) and, on the other, stylistic features of the abstracts (length, density, and lexical complexity). A first clear takeaway is that not all sentiment metrics are capturing the same construct. In particular, the sentiment_index aligns very tightly with the probability of positive tone (trf_pos) and strongly opposes the probability of negative tone (trf_neg). The key implication is that the sentiment_index is largely tracking the same dominant axis as the transformer-based classifier—an overall “positive framing versus negative framing” orientation—rather than more delicate contextual nuances of polarity. By contrast, TextBlob polarity (tb_polarity) shows only weak links to that dominant axis, which is methodologically informative: in technical academic language, some lexicon-based tools may react to different cues (e.g., normative vocabulary, cautions, “limitations”) without necessarily converging with models designed to classify overall tone.
Where
Figure 5 becomes especially valuable for the discussion is in the relationship between tone and style. Length indicators (word and sentence counts) move almost as a block with text_energy, suggesting that much of what looks like “intensity” is driven by how much is said and how extensively claims are elaborated, rather than by affective tone. In that context, a meaningful interpretive pattern emerges: longer abstracts are associated with a higher probability of negativity (trf_neg) and, at the same time, with a lower sentiment_index. The substantive reading is not that “length causes negativity,” but that more developed abstracts tend to include more methodological caution, limitations, risks, and boundary conditions—features that supply stronger negative or less positive cues to the classifier. This matters because it introduces a confounding consideration: when comparing discursive orientation across studies, abstract length can influence the likelihood that cautionary language appears, thereby shifting sentiment signals even if the core contribution is not “negative.”
At the same time, lexical diversity (lexical_diversity) is positively associated with the sentiment_index and with trf_pos and negatively associated with trf_neg. Interpreted substantively, this suggests that when an abstract employs a more varied vocabulary—rather than relying on repeated technical formulae—it tends to construct a more propositional, opportunity-oriented framing: it articulates applications, scope, and contributions with richer language. Conversely, lower lexical diversity may reflect a tighter focus on constraint and risk-related phrasing. This is not a causal claim, but it offers a strong narrative insight for your review: when “technical optimism” is present, it often comes with more linguistically articulated framing, not necessarily with greater length.
6. Conclusions
Artificial intelligence is reshaping financial education as a socio-technical capability mediated by algorithms, data, and digitally supported decision environments. Across the reviewed literature, AI-related tools—particularly natural language processing, conversational systems, and recommendation-based learning supports—are increasingly associated with adaptive and personalized pedagogical models. At the same time, these developments must be interpreted in light of the COVID-19 pandemic (2020–2022), which accelerated the digitalization of teaching, learning, and financial interactions and created strong incentives for scalable technology-mediated educational solutions. This pandemic-driven acceleration likely contributed to the thematic growth of AI in financial education during the review period, but it also intensified a central weakness in the field: the speed of adoption has often exceeded the pace of rigorous empirical validation.
The keyword and thematic analyses confirm that AI currently functions as an organizing axis connecting financial literacy, economic behavior, and emerging digital technologies. However, this centrality should not be interpreted as evidence of methodological maturity. The field remains largely exploratory and heterogeneous, with substantial variation in intervention designs, outcome definitions, and evaluation criteria. As a result, claims about sustained impacts on financial behavior remain constrained by the scarcity of longitudinal studies, controlled comparisons, and standardized metrics across contexts. In parallel, the prominence of ethics-related terms indicates that privacy, fairness, transparency, and algorithmic bias are not secondary concerns, but structural conditions for responsible implementation, particularly in settings characterized by institutional asymmetries and uneven digital preparedness.
The sentiment analysis adds a complementary perspective by showing that research on AI and financial education is communicated in a predominantly technical and neutral tone, with a slight positive orientation toward AI as a facilitator of personalization and access. Nevertheless, these sentiment patterns should be interpreted as indicators of discursive framing rather than as direct evidence of pedagogical effectiveness. In other words, the linguistic orientation of abstracts reflects how authors position opportunities, risks, and ethical tensions, but it does not provide causal estimates of learning gains, retention, or durable behavioral change. This distinction is critical for avoiding overinterpretation and reinforces the need to combine computational text analysis with stronger empirical designs that directly test outcomes over time.
A major implication of this review is that future research must move beyond proof-of-concept enthusiasm and address underexplored empirical questions with greater rigor. In particular, the literature still provides limited evidence on whether AI-supported financial education produces long-term changes in saving, borrowing, budgeting, and financial decision-making beyond short-term gains in knowledge or engagement. The field also needs more causal evaluation strategies, including experimental and quasi-experimental designs, to isolate the contribution of AI-enabled features relative to conventional digital instruction. Likewise, bias, explainability, and fairness require deeper examination in vulnerable and resource-constrained populations, where automated systems may unintentionally reproduce exclusion, misclassification, or opaque recommendations.
Another critical gap concerns cultural and linguistic adaptation. Many AI tools are discussed as if they were universally transferable, yet financial behaviors, educational practices, and decision norms are deeply shaped by language, context, and local institutional conditions. For this reason, future studies should evaluate AI-mediated financial education across diverse linguistic and cultural settings rather than extrapolating findings from high-resource environments. Progress also depends on the development of standardized and comparable metrics for financial literacy, learning outcomes, engagement, and behavioral change, since the current lack of measurement consistency limits cumulative knowledge and cross-study synthesis. In addition, cost-effectiveness and inclusion should become central evaluation dimensions, especially in emerging and resource-constrained contexts, where the practical value of AI depends not only on whether it improves learning but also on whether it does so efficiently, equitably, and at scale.
An important methodological limitation of this study is that the natural language processing (NLP) pipeline—particularly the sentiment analysis applied to article abstracts—does not directly estimate the pedagogical effectiveness of AI tools in financial education. Rather, this procedure primarily captures the discursive tone, framing, and rhetorical orientation through which authors report their findings (e.g., emphasis on opportunities, ethical concerns, or promising results), but it does not measure the causal magnitude of educational impact on outcomes such as learning gains, retention, behavioral change, or sustained improvement in financial capabilities. Accordingly, the sentiment results should be interpreted as a complementary indicator of the scientific narrative rather than as direct evidence of effectiveness. Future research should therefore complement this approach with meta-analyses of measurable outcomes, longitudinal designs, and comparable metrics of learning performance and financial behavior to better connect NLP-based analytical sophistication with stronger practical inferences about the real effects of AI in educational settings.