Next Article in Journal
Stochastic Blade Pitch Angle Analysis of Controllable Pitch Propeller Based on Deep Neural Networks
Previous Article in Journal
Modeling and Optimization of Maintenance Strategies in Leasing Systems Considering Equipment Residual Value
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Mathematical Knowledge Graphs with Large Language Models

by
Antonio Lobo-Santos
and
Joaquín Borrego-Díaz
*
Departamento de Ciencias de la Computación e Inteligencia Artificial, E.T.S. Ingeniería Informática, Universidad de Sevilla, Avda. Reina Mercedes s.n., 41012 Sevilla, Spain
*
Author to whom correspondence should be addressed.
Modelling 2025, 6(3), 53; https://doi.org/10.3390/modelling6030053
Submission received: 24 February 2025 / Revised: 29 April 2025 / Accepted: 20 June 2025 / Published: 24 June 2025

Abstract

The rapid growth in scientific knowledge has created a critical need for advanced systems capable of managing mathematical knowledge at scale. This study presents a novel approach that integrates ontology-based knowledge representation with large language models (LLMs) to automate the extraction, organization, and reasoning of mathematical knowledge from LaTeX documents. The proposed system enhances Mathematical Knowledge Management (MKM) by enabling structured storage, semantic querying, and logical validation of mathematical statements. The key innovations include a lightweight ontology for modeling hypotheses, conclusions, and proofs, and algorithms for optimizing assumptions and generating pseudo-demonstrations. A user-friendly web interface supports visualization and interaction with the knowledge graph, facilitating tasks such as curriculum validation and intelligent tutoring. The results demonstrate high accuracy in mathematical statement extraction and ontology population, with potential scalability for handling large datasets. This work bridges the gap between symbolic knowledge and data-driven reasoning, offering a robust solution for scalable, interpretable, and precise MKM.

1. Introduction

Advancements in Artificial Intelligence (AI) have transformed numerous fields, including Mathematical Knowledge Management (MKM)—an interdisciplinary field concerned with representing, accessing, and maintaining mathematical knowledge using computational tools, encompassing areas like formal libraries, automated theorem proving, and mathematical search [1,2]. In this context, two key technologies stand out: knowledge graphs (KGs) and large language models (LLMs). KGs provide a structured and semantically rich representation of data, which is essential for managing complex mathematical knowledge. LLMs, in contrast, excel at understanding and generating natural language, enabling the extraction, interpretation, and organization of formal mathematical statements.

1.1. Context and Importance of the Study

The integration of KGs and LLMs has attracted considerable interest as it combines the advantages of structured and unstructured data processing. KGs facilitate advanced reasoning and semantic search by organizing data into meaningful relationships [3,4], while LLMs utilize extensive training on textual data to support a wide range of Natural Language Processing (NLP) tasks. Although both technologies have proven highly effective, their individual limitations become evident in scenarios that require high precision and domain-specific reasoning—such as those encountered in MKM. Recent empirical studies highlight significant failure modes: LLMs can exhibit high rates of hallucination (generating plausible but false information) and logical inconsistency, even when grounded with KG context [5,6].
By storing exact formulations, assumptions, and proofs within a KG, it becomes possible to rigorously support MKM.

1.2. State of the Research Field and Objectives

Recent studies emphasize the potential of combining structured ontologies with LLM-driven techniques to enhance domain-specific knowledge extraction and verification [7,8]. Nevertheless, challenges persist, particularly in the automated construction of high-quality KGs in domains where formal precision is critical [9,10].
Several influential approaches tackle the integration of KGs and LLMs for mathematical tasks, representing the current state of the art. Graph-Constrained Reasoning (GCR) frameworks aim to ensure faithful KG reasoning by integrating the KG structure directly into the LLM decoding process, thereby reducing hallucinations but often requiring complex KG curation [11]. Retrieval-Augmented Generation (RAG) pipelines, such as those evaluated by LemmaHead, focus on enhancing theorem proving by retrieving relevant context (definitions and lemmas) from authoritative sources, but performance still depends heavily on the quality and structure of the retrieved information [12]. Concurrently, initiatives like the Xena Project are building large-scale knowledge graphs based on formal proof assistant libraries like Lean 4 Mathlib, offering high logical rigor but demanding significant formalization effort [13,14]. Our approach contrasts with these by proposing a more lightweight and adaptable ontology designed for automated extraction from semi-structured LaTeX sources, aiming for a balance between scalability, formal structure, and ease of population, particularly suited for managing knowledge across diverse mathematical documents.
In MKM, ensuring the logical consistency and reproducibility of the underlying knowledge is essential. Our approach leverages LLMs to automate aspects of KG construction while employing a lightweight ontology-based framework to preserve logical integrity. This integration not only facilitates the validation of mathematical curricula but also enables the storage of precise computational methods, supporting their accurate retrieval and application in simulation workflows as needed.

1.3. Purpose, Aims, and Contributions

This study explores the integration of KGs and LLMs to address the key challenges in MKM. The central hypotheses of this work are as follows:
  • The combined use of LLMs and ontology-driven KGs can significantly reduce the manual effort involved in constructing domain-specific knowledge repositories.
  • Ontology-driven KGs improve the reasoning capabilities of LLMs by enforcing logical consistency and domain-specific constraints.
  • The integrated approach provides a solution for validating mathematical curricula, demonstrating feasibility for handling large document sets through automated extraction and offering measurable consistency checks.
Our main contributions are as follows:
  • A methodology that integrates KGs and LLMs for the extraction, validation, and management of mathematical knowledge.
  • The design of a lightweight ontology to represent the structure of mathematical statements and proofs, thereby enabling advanced reasoning tasks.
  • An automated extraction pipeline that processes structured LaTeX documents, allowing large-scale repositories—such as those from arXiv—to be efficiently incorporated into MKM systems.
  • A quantitative evaluation on a synthetic mathematical dataset, demonstrating the effectiveness of the information extraction pipeline (Section 3), adopting evaluation principles aligned with rigorous benchmarking efforts like SBI-RAG for math word problems [15].
The remainder of this paper is structured as follows: Section 2 details the materials and methods, including the KG construction, ontology design, information extraction process, and use case algorithms. Section 3 presents the results of the information extraction evaluation and demonstrates the system’s use cases. Section 4 discusses the findings, technological challenges, limitations, and future directions. Finally, Section 5 concludes the paper by summarizing the key contributions and implications.

2. Materials and Methods

2.1. KG Construction and Maintenance

The construction and maintenance of KGs represent a fundamental challenge in the fields of AI and knowledge representation and reasoning (KRR). As the volume and complexity of data continue to increase, the need for robust and scalable solutions for KG management has become increasingly critical [4].
Recent research has explored various strategies to address the challenges associated with KG generation and upkeep. A key focus has been the development of techniques to ensure the consistency and validity of KG data over time. Hogan et al. [16] propose a framework for managing the evolution of KGs, emphasizing versioning, change detection, and update propagation. Their work underscores the importance of preserving the integrity of KG data as new information is added or existing data is revised.
In parallel, researchers have developed methods for automatically generating and populating KGs from heterogeneous data sources. Xue et al. [4] provide a comprehensive review of the state of the art in KG construction, highlighting the integration of structured and unstructured data and the application of machine learning techniques in this process.
The integration of LLMs with KG construction has emerged as a particularly promising area of research. Pan et al. [9] investigate the synergies between LLMs and KGs, demonstrating how their combination can enhance the accuracy, interpretability, and reasoning capabilities of AI systems. Their findings have inspired the development of novel approaches that leverage LLMs to support KG construction and maintenance.
One promising approach is the use of chain of thought (CoT) prompting, as described by Wei et al. [17]. CoT prompting guides LLMs to generate intermediate reasoning steps, thereby fostering a deeper understanding of the task. By incorporating such reasoning processes, CoT prompting improves the interaction between LLMs and task-specific objectives, such as KG generation. This technique leverages the user’s initial input to evoke more structured and relevant responses from the LLM, ultimately enhancing the accuracy and relevance of the generated KG data.
In addition, several researchers have investigated the integration of active learning and human-in-the-loop architectures to support KG construction and maintenance. Xue et al. [4] emphasize the value of incorporating user feedback and domain expertise to improve the accuracy and completeness of KGs over time.

2.2. Validation, Quality, and Reasoning

The use of ontologies and the SPARQL query language (https://www.w3.org/TR/sparql11-overview/ (accessed on 20 February 2025)) in our work provides a scalable approach to managing and querying the KG, offering efficient mechanisms for leveraging LLMs on large-scale KGs. The ontology-based representation enables efficient storage and retrieval of mathematical knowledge, while SPARQL queries support complex reasoning and inference over the graph.
The formal structure of the ontology—comprising defined classes, properties, and constraints—helps to ensure the consistency and validity of the represented mathematical content, serving as a quality control mechanism in KG construction. This structured approach not only enhances the reliability of the knowledge representation but also enables more accurate and meaningful interactions between the KG and LLMs, ultimately improving the overall system performance in mathematical reasoning and knowledge management.
Additionally, although our current implementation does not explicitly optimize for large-scale deployments, the architecture can inherently support scalability by clearly separating concerns between the LLM-serving layer and the KG-querying layer. For instance, it is possible to integrate an efficient serving infrastructure, such as deploying LLMs behind a scalable inference stack like vLLM, which manages GPU resources through pageable key–value caching [18]. Similarly, the KG querying could leverage distributed RDF databases, partitioning triples into shards managed by systems such as TriAD, which parallelizes SPARQL query execution across multiple nodes [19] and employs federation strategies inspired by FedX to minimize intermediate data transfer between shards [20]. Thus, the described architecture is inherently amenable to efficient distributed implementations, even though such optimizations remain beyond the scope of our current work.

2.3. Interfaces

The effective management and utilization of KGs require the development of user-friendly interfaces that support intuitive exploration, querying, and manipulation of the underlying data. As KGs grow in complexity and span diverse domains, the demand for accessible and interactive visualization tools has become increasingly important [21,22].
Previous research has emphasized the importance of designing interfaces that bridge the gap between the formal structured nature of KGs and the cognitive processes of human users [23,24]. Our work addresses several of these considerations by providing
  • An interactive web-based interface for querying, visualizing, and editing the KG.
  • Mechanisms for detecting and correcting errors in the extracted information, incorporating a basic human-in-the-loop approach.
While advanced features such as comprehensive KG versioning or active learning are beyond the scope of this study, the implemented web interface supports iterative improvements to data quality and lays the groundwork for future enhancements.

2.4. Information Extraction Process

The overall workflow of our information extraction system is illustrated in Figure 1. The system is designed to process LaTeX documents, extract blocks of mathematical statements (e.g., theorems, axioms, and definitions), identify relevant symbols or hypotheses, and incorporate them into a KG and ontology.
Below, we describe each stage of the pipeline in detail.

2.4.1. Stage 1: LaTeX Document Input Processing

The process begins with the input of one or more LaTeX documents. We apply
  • A regular expression (RegEx) approach to identify structural elements, such as chapters, sections, and subsections.
  • A hierarchical JSON representation is created to capture the document’s organization. For example:
    {
      “chapter1":   {
          “sectionA":   {
              “subsectionA1":   "text of that subsection"
            }
        }
    }
This step ensures a clear representation of the LaTeX document structure prior to the extraction of specific mathematical statements.

2.4.2. Stage 2: User Configuration

Users can configure the system by specifying the following:
  • Which LaTeX blocks to extract (e.g., Theorem, Axiom, and Definition).
  • Tags or labels to be assigned to specific chapters or sections for subsequent filtering.
This configuration is saved in a JSON file, such as
  • {
      “hierarchy":   { …  },
      “latex":   {
         “theorem":   ["Theorem", "Corollary"]
        },
        “tags":   {
           “chapter1": ["Infinitesimal Calculus"]
        }
    }

2.4.3. Stage 3: Text Block Extraction

For each lowest-level subdivision (e.g., subsection or paragraph), the system applies
  • A targeted RegEx to detect the user-specified blocks, such as
    \begin{Theorem}…\end{Theorem}.
  • The extracted blocks are returned as an array of strings:
    ["\\begin{Theorem} … \\end{Theorem}",
     "\\begin{Axiom} … \\end{Axiom}", …]
This stage provides the foundation for subsequent analysis by isolating the mathematical blocks of interest.

2.4.4. Stage 4: Mathematical Formula Retrieval

Our approach to retrieving definitions and statements relies on the following components:
  • Notational Assumptions: We assume consistent notation is maintained throughout the document—at least within the same mathematical domain—to ensure clarity and coherence.
  • Semantic Representation: LaTeX expressions are transformed into semantic forms by mapping symbolic expressions to generalized tokens (e.g., set_1, set_2, element_1).
  • Vector Representations: The preprocessed text is encoded into dense vectors using BERT-based models [25], including
    AnReu/math_pretrained_bert [26].
    math-similarity/Bert-MLM_arXiv-MP-class_zbMath [27].
    MathBerta [28].
  • Structure-Based Filtering: While symbolic layout trees or operator trees have been shown to improve retrieval precision [29], our approach prioritizes dense retrieval and does not rely on deep structural matching.
This method is particularly effective when LaTeX blocks contain short domain-specific statements, helping to ensure that retrieved definitions remain contextually relevant.

2.4.5. Stage 5: Retrieval from a Vector Database

Once the relevant text blocks are converted into vector embeddings, they are
  • Indexed in a vector database, enabling efficient similarity-based searches.
  • Utilized within a Retrieval-Augmented Generation (RAG) framework [30], in which the LLM is dynamically provided with the most relevant definitions or statements based on the user’s query.

2.4.6. Stage 6: Chain Process for Statement Analysis

We adopt a chain-of-thought approach [17] (chain) to analyze each extracted block:
  • Detecting the Type of Statement: The system classifies each statement (e.g., theorem, axiom, and definition) based on the categories specified by the user in Stage 2.
  • Symbol Extraction: A JSON-formatted list of symbols is generated, linking each MathematicalObject individual to its corresponding LaTeX representation. For example:
    [
      { "represents": "function", "represented": "f" },
      …
    ]
  • Identifying Hypotheses and Conclusions: For statements with an if–then structure, the system explicitly separates the premise (hypothesis) from the consequence (conclusion).
  • Creating Sub-Statements: In cases involving multiple clauses, the system decomposes compound implications into separate sub-statements, enabling more granular reasoning. This chain-based analysis also checks whether a statement was previously defined. If so, it is included among the retrieved statements from Stage 5; otherwise, it is classified under the UndefinedStatement class to maintain semantic rigor and traceability.
It is important to note that our current system primarily focuses on extracting explicitly stated knowledge and relationships present in the LaTeX source. The inference of implicit mathematical relationships (e.g., relationships not directly stated but logically derivable) is considered a direction for future work (see Section 4.4).

2.4.7. Stage 7: Final Ontology Verification

As the final step, newly extracted or updated statements are
  • Verified against the ontology’s constraints to ensure consistency and validity.
  • Stored in both the ontology and the vector database if they satisfy the consistency checks.

2.5. Front-End System Overview

A web-based user interface has been developed to streamline the workflows described above and to provide a comprehensive environment for MKM. Figure 2 presents the homepage of the interface, named Chatex.
The interface is organized into modular components, each tailored to a specific task in the MKM pipeline. Below, we describe the primary functionalities of these components.

2.5.1. Label Setup and Knowledge Base Population

Two main interface components support the extraction workflow:
  • Label Setup: Users can configure labels for mathematical statements or document sections through a dedicated web interface. An example of this functionality is illustrated in Figure 3, where existing entries can be viewed, and new entries can be added or updated within the database.
    Figure 3. Chatex interface for managing labels in the database, including options to view existing entries and add or update new ones.
    Figure 3. Chatex interface for managing labels in the database, including options to view existing entries and add or update new ones.
    Modelling 06 00053 g003
  • Populate Knowledge Base: Users can upload LaTeX files, specify which environments to track (e.g., Theorem and Definition), configure label assignments, and monitor the extraction process in real time. Figure 4 shows the interface for uploading documents to the knowledge base. Figure 5 presents the configuration page where users select elements to extract and apply labels. Finally, Figure 6 displays the processing logs generated during information extraction.

2.5.2. Similar Statements and Graph Visualizations

Once statements are extracted and indexed, users can query the system for similar statements, as illustrated in Figure 7. Retrieved results from the vector database are presented within an interactive interface, allowing rapid exploration and classification of analogous or related theorems, definitions, and lemmas.
Additional interface modules are available for advanced functionality, as illustrated in Figure 8. The example shown corresponds to the Maximize Conclusions module. These modules are presented here to demonstrate the interface design; the specific algorithms underlying each use case are detailed in Section 2.8.

2.5.3. Ontology Visualization and Export

After populating the knowledge base, the system offers graph visualization capabilities:
  • Spring-Embedded and Shift Layouts: Spring-embedded layouts employ force-directed algorithms to emphasize clusters and reveal topological features. Shift layouts use geometric positioning to ensure planar crossing-free visualization of planar graphs.
  • Hierarchical Layout: Based on the method of Sugiyama et al. [31], this layout organizes nodes into layers and reduces edge crossings, making it particularly suitable for visualizing logical inference flows.
An example of the ontology visualization interface is shown in Figure 9, displaying an RDF graph where predicates such as assumesThat and impliesThat can be selectively visualized using Spring Layout.
For advanced analysis, visualizations can be exported in GEXF format and imported into Gephi (https://gephi.org/ (accessed on 20 February 2025)), allowing for custom styling, complex filtering, and large-scale graph analytics.

2.6. On the Design of a Lightweight Ontology for MKM

A central contribution of this work is the development of a lightweight ontology, specifically designed to represent core elements of mathematical discourse in a machine-interpretable form. The ontology was implemented using the Web Ontology Language (OWL) and serves as the backbone of the system’s reasoning and validation capabilities. This section details its design objectives, class structure, logical foundations, and integration into the knowledge extraction workflow.

2.6.1. Scope, Purpose, and Intended Use

The ontology aims to formally model mathematical statements and their interrelations, including hypotheses, conclusions, definitions, proof steps, and mathematical objects. Its primary functions are
  • Enabling the storage and retrieval of formalized knowledge.
  • Supporting logical reasoning and validation of mathematical content.
  • Serving as a semantic interface between LLM-generated content and symbolic knowledge.

2.6.2. Ontology Construction and Versioning

The ontology was developed iteratively, guided by the information extraction pipeline’s needs and the logical dependencies encountered in formal mathematical texts. The process included the following:
  • Conceptual modeling: Identification of key conceptual categories in mathematical texts, such as Theorem, Definition, Axiom, Corollary, and ProofStep.
  • Class definition: These concepts were formalized into OWL classes, organized under a root class MathematicalEntity. Core classes include
    • MathematicalStatement: superclass of Theorem, Definition, Axiom, Lemma, and Corollary.
    • MathematicalDescriptor: includes Symbol, Notation, and MathematicalObject (e.g., Function, Set, and Number).
    • MathematicalStep: includes reasoning methods such as Deduction, Induction, and ReductionToProblem.
  • Property modeling: The following object properties were defined to capture logical structure:
    • assumesThat: links a statement to its hypotheses.
    • impliesThat: links a statement to its conclusions.
    • isProved / proves: bidirectional relation between statements and their proofs.
    • hasSymbol, represents, hasNotation: to connect symbolic representations with abstract entities.
  • Logical constraints: Domain and range restrictions, disjointness axioms, and inverse property assertions were added to ensure semantic consistency.
A simplified view of the inferred class hierarchy is shown in Figure 10. The resulting ontology is organized around the following key classes:
  • MathematicalStatement: Represents formal mathematical statements, including axioms, theorems, definitions, corollaries, and lemmas.
  • MathematicalDescriptor: Encodes components used to describe mathematical content, such as notations, mathematical objects (e.g., functions, sets, and vectors), symbols, and proofs.
  • MathematicalStep: Describes individual steps in a proof, including deduction, induction, recursion, and reductio ad absurdum.
The ontology defines several key properties to capture the logical relationships between mathematical statements and their components:
  • assumesThat and impliesThat: These properties relate mathematical statements to their respective hypotheses and conclusions, enabling the representation of logical implication chains within the ontology.
  • hasSymbol, hasNotation, and represents: These properties associate mathematical statements with the symbols, notations, and mathematical objects they reference, thereby supporting semantic linking and search.
  • isProved and proves: These reciprocal properties establish the relationship between a mathematical statement and its proof. Specifically, isProved connects a statement to its corresponding proof, while proves asserts that a given proof validates a particular statement.
Consistency is ensured through both logical reasoners and manual verification.
Figure 10. Inferred hierarchy (i.e., the class structure derived automatically from logical axioms and subclass relations) of an ontology of mathematical results. The visualization shows classes hierarchically organized under owl:Thing, including main categories such as MathematicalStatement, MathematicalStep, object, and MathematicalDescriptor. It highlights different kinds of mathematical objects (e.g., set, function, and number), reasoning methods (deduction, induction, and ReductionToProblem), and foundational results (axiom, theorem, corollary, and definition). Is–a relations indicate conceptual specialization between entities.
Figure 10. Inferred hierarchy (i.e., the class structure derived automatically from logical axioms and subclass relations) of an ontology of mathematical results. The visualization shows classes hierarchically organized under owl:Thing, including main categories such as MathematicalStatement, MathematicalStep, object, and MathematicalDescriptor. It highlights different kinds of mathematical objects (e.g., set, function, and number), reasoning methods (deduction, induction, and ReductionToProblem), and foundational results (axiom, theorem, corollary, and definition). Is–a relations indicate conceptual specialization between entities.
Modelling 06 00053 g010

2.6.3. Ontology Versioning and Availability

The current ontology version is 1.0 and is publicly available for reuse and extension. It can be accessed at
https://gitlab.com/universidad4774909/tfg/chatex-webui/-/raw/main/TFG1.owx
(accessed on 20 February 2025)

2.6.4. Integration with the Information Extraction Pipeline

The ontology is tightly integrated into the extraction pipeline and plays a role in each of the following stages:
  • Post-processing: Each extracted block is matched against the ontology schema to identify its class.
  • Semantic enrichment: Statements are linked to previously defined objects and notations.
  • Reasoning: Ontology reasoners are used to infer class memberships and check for logical coherence. For example, if a new Theorem is linked to undefined Function symbols, these are flagged and stored under the UndefinedEntity class for later validation.
  • SPARQL queries: These are used to detect implication chains, retrieve related definitions, and assess proof completeness.

2.6.5. Ontology Reasoning and Mathematical Information Extraction

The ontology is designed to work in conjunction with a logical reasoner, which ensures the consistency of the knowledge base and facilitates the inference of new relationships between mathematical entities. This reasoning capability enables
  • Detection of logical inconsistencies or errors within stored statements.
  • Deduction of implicit relationships and knowledge from the existing ontology structure.
  • Validation of newly extracted statements by comparing them against established logical rules and semantic links.
To populate the ontology with mathematical knowledge, novel techniques were developed to extract information directly from LaTeX documents. Leveraging LLMs in combination with LangChain processing pipelines, the system can automatically identify and structure core components of mathematical discourse—including hypotheses, conclusions, and proof steps. This structured information is then integrated into the ontology, thereby enriching its content and enabling more sophisticated reasoning over the KG.

2.7. Prompt Templates

Throughout the information extraction process, we employ specialized prompt templates for LLMs to ensure consistency, accuracy, and relevance in the model outputs. Key considerations include
  • Prompt Format: Different LLMs (e.g., llama3.1 or command-r) may require different prompt structures and formatting conventions [32].
  • Few-Shot Examples: Where possible, prompts are augmented with annotated examples that demonstrate how to parse mathematical statements or identify hypotheses.
  • Self-Reflection Approaches: Drawing on recent advances in reasoning strategies [33,34,35], prompts may include instructions that encourage the model to re-check or refine its intermediate outputs.
  • Symbol Delimitation: Explicit guidance is provided to help the model distinguish between symbols and text, facilitating more robust semantic extraction.
By carefully designing these prompt templates and chaining multiple tasks (e.g., identifying statement type and extracting premises), we reduce inconsistencies and ensure high-quality representations of mathematical content in the KG and ontology. A representative prompt template is provided in Appendix A, and the complete set of carefully engineered templates tailored for Llama 3.1 is available at https://gitlab.com/universidad4774909/tfg/chatex-webui (accessed on 20 February 2025), facilitating reproducibility and further exploration.

2.8. Use Case Examples and Algorithmic Implementations

The integrated system supports a variety of use cases in MKM by combining graph-based algorithms with ontology-driven reasoning. This subsection outlines the principal algorithms implemented and their applications.

2.8.1. Optimization of Hypotheses

Objective
Given a set of hypotheses H = { h 1 , h 2 , , h n } and conclusions C = { c 1 , c 2 , , c n } , all categorized under the definition class ( H , C Definition ), the algorithm identifies the minimal set of additional hypotheses required to derive all conclusions in C.
Algorithmic Approach
Inspired by depth-first search (DFS), the algorithm recursively explores
  • Which definitions or propositions in the ontology imply a given conclusion.
  • The prerequisite assumptions needed for each implication to hold.
When multiple inference paths are available, the algorithm selects the one requiring the fewest additional hypotheses.
Output
The algorithm returns a mapping from each conclusion to the minimal set of required assumptions. For example:
  • {
      “Proposition1”:   [
        “Definition_Anonymous”,
        “Definition_Compact_Interval”
       ],
      “Proposition3”:   [
          “Definition_Anonymous”
      ]
    }
Use Case
In an educational setting, this approach identifies the minimal prior knowledge needed to teach a specific concept. For instance, deriving the definition of a “bounded function” may require understanding of “rational functions,” “compact intervals,” and “denominator properties.” Similarly, the algorithm could analyze the minimal prerequisites for understanding Lebesgue integration, identifying that students should first be familiar with “measure theory,” “series convergence,” and “multivariable calculus,” following the structure found in second-year courses at the University of Seville [36].

2.8.2. Maximization of Conclusions

Objective
Given a set of hypotheses H = { h 1 , h 2 , , h n } , the algorithm computes the maximal set of conclusions C = { c 1 , c 2 , , c m } derivable via the ontology’s impliesThat relationships.
Algorithmic Approach
Using a breadth-first search (BFS) strategy, the algorithm
  • Initializes a queue with all hypotheses in H.
  • Iteratively retrieves all conclusions implied by each hypothesis.
  • Adds newly derived conclusions to the queue if they have not yet been processed.
Output
The result is the full set of reachable conclusions. Optionally, the derivation process can be visualized as a graph, with nodes representing statements and edges representing implication relationships.
Use Case
This algorithm assists with curriculum design by identifying the full range of concepts implied by a set of foundational topics. For example, starting with “rational functions” and “compact intervals,” it may derive “bounded functions” and related concepts, making explicit the implications of the selected curriculum base. Additionally, in a real educational scenario such as the third-year Algebraic Structures course at the University of Seville [36], instructors could input foundational topics like “Basic Algebra” and “Discrete Mathematics” to verify and visualize all higher-level algebraic concepts accessible to students after completion of these introductory courses.

2.8.3. Pseudo-Demonstrations

Objective
Given a set of hypotheses H = { h 1 , h 2 , , h n } and conclusions C = { c 1 , c 2 , , c m } , the algorithm constructs a minimal hierarchical graph (pseudo-demonstration) that connects H to C via the shortest possible chain of impliesThat relationships.
Algorithmic Approach
A modified Dijkstra’s algorithm [37] is used as follows:
  • Each implication edge is assigned a uniform weight of 1.
  • The algorithm searches for the shortest paths from elements in H to each target conclusion in C.
  • If a conclusion is unreachable, the system flags it as such.
Output
The output is a hierarchical graph that illustrates the minimal implication paths from H to C. Unreachable conclusions are clearly indicated.
Use Case
This tool supports logical reasoning and instructional design by visualizing the step-by-step progression needed to teach or derive a concept. For instance, it might show that the concept of a “bounded function” follows from prior knowledge of “rational functions,” “compact intervals,” and “denominator properties.” Similarly, in theorem-proving exercises, the pseudo-demonstration feature can help educators to visually demonstrate how certain propositions in Complex Analysis—such as Cauchy’s Integral Theorem—logically depend on previously established concepts like holomorphic functions and path integrals. This closely mirrors the pedagogical structure typically followed in Complex Analysis courses. The benefits are twofold: (1) students can verify whether they have already learned all prerequisite statements needed to understand a new concept, and (2) both students and instructors can easily design and evaluate a sequential study plan. In particular, this corresponds to a topological ordering as defined in [38], where each node (i.e., concept or statement) is only introduced after all its prerequisite nodes have been covered.

2.8.4. Academic Coherence and Course Planning

Objective
To support the design of academically coherent courses by analyzing the logical dependencies among mathematical concepts. Specific goals include
  • Identifying concepts already covered in previous courses.
  • Detecting missing prerequisites necessary to meet learning outcomes.
  • Constructing pseudo-demonstrations that connect prior knowledge to new objectives.
Methodology
The course planning algorithm follows three main stages:
  • Minimal Known Hypotheses: Filter out derivable content already covered in prior coursework to isolate the essential hypotheses.
  • Hypothesis Minimization: Apply the Optimization of Hypotheses algorithm (Section 2.8.1) to determine the minimal set of additional prerequisites needed.
  • Pseudo-demonstration Construction: Use the pseudo-demonstrations algorithm (Section 2.8.3) to build a hierarchical graph linking existing knowledge to new course content.
Use Case
Instructors can use this method to design courses that build systematically on students’ prior knowledge. By identifying gaps, reducing redundancy, and visually mapping dependencies, educators can ensure that course content follows a clear, logically grounded progression. For instance, this method can be practically applied to designing a coherent sequence of second- to third-year courses in the University of Seville mathematics program [36], clearly identifying and visually documenting how core second-year topics—such as “topology” and “Lebesgue integration”—serve as essential foundations for third-year courses like “Geometry and Topology of Surfaces.” For example, the Classification Theorem for Compact Surfaces depends on concepts such as the fundamental group and orientability, introduced in topology, while the Gauss–Bonnet Theorem requires measure-theoretic tools from Lebesgue integration. This approach makes it easy to analyze whether two courses can be taught simultaneously or if one must precede the other due to dependencies in their key statements.

2.9. Model Selection Criteria

Although the selection of LLMs is not the focus of this work, we identified certain capabilities as essential for the development of our framework: (1) mathematical reasoning, assessed using the Omni-math Benchmark [39]; and (2) faithfulness, evaluated via the Faith-Eval framework [40]. While our current implementation leverages these tools to inform LLM use, a comprehensive evaluation tailored to our specific application domain is needed for deployment in a production setting. Future work should include benchmarking these capabilities using domain-specific datasets, potentially integrating tools such as Ragas (https://docs.ragas.io/en/stable/ (accessed on 20 February 2025)) to assess reliability in retrieval-augmented generation pipelines.
It is important to note that our evaluation exclusively utilized open-source LLMs that were accessible and deployable on our local server infrastructure at the time the experiments were conducted.

3. Results

This section presents the evaluation of the proposed framework, focusing on two key aspects: (1) the quantitative effectiveness of the automated information extraction (IE) pipeline, and (2) a qualitative demonstration of the system’s utility through its core use cases.

3.1. Information Extraction Effectiveness

To rigorously assess the performance of the IE process, specifically the extraction of hypotheses and conclusions from mathematical statements, we developed a synthetic benchmark dataset. This dataset was generated using a Python script employing LLMs to create diverse mathematical statements across various fields, followed by manual correction and validation by the authors to ensure logical soundness and correctness. The generation script and resulting dataset are publicly available (https://gitlab.com/academic15/llm-synthetic-data-generation-mathematical-conlcusions-and-hypotheses (accessed on 20 February 2025)). An example entry from this dataset, illustrating the structured format, including definitions, hypotheses, conclusions, and field labels, is shown in Figure 11.
Following the guidelines outlined in [41,42], we designed a Proof of Concept (PoC) benchmark to evaluate IE performance. The benchmark adheres to the following principles:
  • High performance on the benchmark should indicate robust in-domain task performance.
  • Examples must be clearly annotated and unambiguous.
  • Test samples should undergo thorough validation to eliminate erroneous or ambiguous cases.
  • The dataset must provide sufficient statistical power for rigorous evaluation.
  • The benchmark should identify and discourage the development of biased models by exposing potential harmful biases.
The dataset includes data evenly distributed across multiple branches of mathematics. It contains a diverse set of mathematical definitions, hypotheses, and conclusions from domains such as real analysis and group theory. Figure 11 shows a sample entry in JSON format, illustrating a theorem about finite cyclic groups within the domain of group theory.
Figure 12 presents the composition of the synthetic dataset, showing the number of samples distributed across different branches of mathematics.
The evaluation scripts and additional benchmarking tools are publicly available at https://gitlab.com/academic15/custom-benchmarking-llms (accessed on 20 February 2025).
First, we evaluated different embedding models for the task of retrieving relevant definitions, a crucial step in providing context to the LLMs (Stage 5 of the IE process, Section 2.4.4). Table 1 compares the performance of various embedding models in retrieving relevant definitions using the PoC dataset. The evaluation is based on accuracy and recall as the models were configured to return exactly 10 candidate definitions per query for use in downstream LLM prompts; thus, precision is not reported. The results indicate moderate performance, with math-specific pre-trained models like math_pretrained_bert [26] and Bert-MLM_arXiv [27] showing a slight advantage, underscoring the challenge of semantic retrieval in the mathematical domain.
Next, we evaluate and compare the performance of LLMs categorized by their size into three groups: small (2–3B parameters), medium (7–9B), and large (35–72B). This evaluation focuses on the core IE task: extracting hypotheses and conclusions based on provided context (definitions) and theorem statements. Performance was measured using standard metrics: accuracy, precision, recall, and F1 score, assessing the exact match of extracted hypothesis/conclusion sets against the ground truth.

3.1.1. Small LLMs

This category includes llama3.2:3B [43], phi3.5 [44], and gemma2:2B [45]. Table 2 summarizes their performance. Phi3.5 achieved the highest F1 score in this category, suggesting strong capability among smaller models for this structured extraction task.

3.1.2. Medium LLMs

This category includes mathstral [46], qwen2.5:7B, qwen2-math:7B [47,48], gemma2:9B [45], llama3:8B-instruct, and llama3.1:8B [49]. The results are presented in Table 3. Those models specifically fine-tuned for mathematics (mathstral and qwen2-math:7B) generally performed well, with F1 scores around 0.74. This suggests that domain-specific training enhances performance on this structured extraction task, although the overall scores indicate that the task remains challenging.

3.1.3. Large LLMs

This category includes mixtral:8x7B-instruct-v0.1-fp16 [50], command-r:35B, qwen1.5:72B [51], llama2:70B-chat [52], llama3:70B-instruct, and llama3.1:70B [49]. Table 4 shows their performance. Interestingly, the largest models did not significantly outperform the medium or even the best small models on this specific task, with the F1 scores remaining around the 0.73–0.75 range. Qwen1.5:72B showed strong performance, comparable to the best small model. This suggests that, for structured extraction tasks like this, model scale beyond a certain point may yield diminishing returns compared to factors like fine-tuning or architectural choices. The achieved F1 scores, while respectable, indicate that errors still occur, likely due to the challenges in handling complex logical structures, parsing nuances in mathematical language, and potential LLM limitations like hallucination or reasoning gaps, as discussed in Section 4.1.1.
Figure 13 shows the mean execution time required by each large language model (LLM) to compute hypotheses and conclusions across the evaluation set. As expected, larger models incur significantly higher processing times, highlighting a crucial trade-off between model size, potential capability (although not always realized for this task), and practical deployment constraints.
Overall, the IE evaluation demonstrates the feasibility of using LLMs for extracting structured mathematical information, achieving F1 scores up to  0.75 on our benchmark. While this level of performance is promising for automating a KG population, it also indicates the need for robust validation mechanisms (like the ontology constraints and potential human-in-the-loop verification discussed earlier) to handle residual errors, especially given the high precision required in mathematical domains.

3.2. Use Case Demonstration

A complete walkthrough of the information extraction pipeline is available via a demonstration video: https://youtu.be/85kChrAtK1k (accessed on 20 February 2025). The use cases discussed in Section 2.8 are illustrated in an additional demo: https://youtu.be/tq-W4QAR1_s (accessed on 20 February 2025).
This demonstration highlights the following key steps:
  • Label creation for filtering extracted statements.
  • Ontology population from a well-structured LaTeX document.
  • Visualization of the resulting KG using spring-embedded and planar layouts.
  • Application of reasoning algorithms and visualization using the Sugiyama hierarchical layout.
  • Ontology inspection and validation via the Protégé platform.
In summary, the use case demonstrations confirmed that the integrated system, combining LLM-based extraction with ontology-driven reasoning and visualization, provides a functional and potentially valuable tool for MKM tasks, such as curriculum analysis, knowledge exploration, and consistency checking.

4. Discussion

This work presents a comprehensive exploration of integrating LLMs with an ontology-based approach to construct KGs tailored for MKM. Throughout the project, several key challenges and opportunities were identified.

4.1. Technological Challenges and Limitations

This subsection outlines the primary technological challenges encountered during development, highlighting their implications and potential limitations.

4.1.1. Limitations of LLMs

Our experiments with diverse LLMs revealed several persistent limitations:
  • Hallucinations: Fabrications of intermediate results, definitions, and proofs persisted despite meticulous prompt engineering. This behavior aligns with recent findings [5], which report pervasive hallucination across model sizes and architectures. Nevertheless, the integration of ontologies can impose domain-specific constraints, helping to delineate output boundaries in line with the Knowledge-Controlled Generation paradigm [53]. Moreover, the structured nature of the KG and subsequent validation steps offer potential mechanisms for identifying and mitigating some inconsistencies introduced by hallucinated content, although such challenges are not entirely eliminated.
  • Symbolic Reasoning Deficiencies: Consistent with prior studies [10,54,55], the LLMs exhibited significant limitations in symbolic and multi-step mathematical reasoning. Their performance was often fragile, with minor semantically irrelevant changes (e.g., altering numerical values or introducing extraneous information) substantially impacting the outcomes, suggesting reliance on superficial pattern matching rather than robust logical inference [10]. Furthermore, the models frequently failed at compositional reasoning, struggling to spontaneously combine known concepts to solve novel problems involving logical traps [54]. Manual analyses confirm that these failures often stem from flawed logical chains, unwarranted assumptions, and difficulties regarding translating physical intuition into formalized steps [55]. Notably, although the expanded context windows permit the inclusion of additional relevant statements, doing so did not consistently enhance reasoning performance, underscoring a fundamental limitation in flexible knowledge integration rather than a mere context size constraint.
  • Parameter and Language Constraints: Models with fewer than 5 billion parameters demonstrated limited capacity for parsing structured mathematical input and exhibited strong reliance on English-language prompts. These constraints reduced the performance robustness across diverse input scenarios. Nonetheless, recent advances in small language models suggest potential for future improvements [56].
  • Dependency on Structured Input: Our system requires well-formatted LaTeX documents, limiting applicability to sources with clean standardized structures. Broader issues of unstructured document parsing, including neglect of tables, diagrams, and semantic metadata, remain challenging [57].

4.1.2. Challenges in Document Conversion and Standardization

Converting mathematical content from diverse sources, especially PDF documents, presented several difficulties:
  • Conversion Quality: Many available tools and APIs generate cluttered LaTeX outputs that include unnecessary style elements, making it harder to identify structural components like theorems and definitions [58].
  • OCR Limitations: Although advanced OCR models (e.g., [59]) were evaluated, they failed to provide the required precision and structure necessary for reliable extraction of complex mathematical notations.
  • Heterogeneity of LaTeX Formats: The lack of a standardized LaTeX format significantly complicates automated extraction. The “ideal” LaTeX template we developed served as a controlled format to mitigate this issue, but widespread adoption of such a format remains a challenge.

4.1.3. Ontology Design and Prompt Engineering

A lightweight ontology, developed using OWL, was central to the structured representation of mathematical knowledge. It supports
  • Enhanced consistency checking for newly extracted statements.
  • Clear mapping of semantic relationships among theorems, definitions, and numerical methods.
  • Integration with reasoning engines for validation, inference, and logical navigation through mathematical content.
In addition, prompt engineering played a crucial role in guiding LLM behavior. Techniques such as few-shot prompting, chain-of-thought reasoning, and structured symbol delimitation were used to mitigate errors and improve the semantic precision of extracted content [33,60].

4.1.4. Scope of Knowledge Extraction:

Our current approach primarily extracts explicitly stated knowledge and relies heavily on the information present within the source LaTeX documents. Systematically identifying potentially missing entities or relationships (e.g., concepts used but not formally defined within the processed text, and implicit logical dependencies) remains a significant challenge beyond simple label resolution. While parsing and validation steps catch some structural inconsistencies, comprehensive detection of semantic knowledge gaps is non-trivial. We currently assume that input documents largely follow a hierarchical dependency structure, aiding in the identification of some structural omissions, but acknowledge this limitation in capturing potentially implicit or missing knowledge, which is a key area for future work (see Section 4.4).

4.2. Cognitive Computing

The integration of symbolic (ontology-based) and subsymbolic (LLM-based) methods reflects a cognitive computing paradigm with clear implications for MKM. By emulating human cognitive processes—combining formal logical reasoning with intuitive language-based inference—this hybrid approach enhances
  • The interpretability of complex mathematical models.
  • The robustness and transparency of decision-making in knowledge-intensive environments.

4.3. Expanding the Methodology to Other Domains

Although this work focuses on MKM, the proposed methodology—integrating LLMs with ontology-guided KG construction—is adaptable to other domains.
In medicine, Arsenyan et al. [61] used LLMs to extract structured information from clinical notes, generating high-quality KGs for applications such as diagnostics and drug discovery. In law, Feng et al. [62] constructed KGs by aligning legal relations with domain ontologies and Wikidata, ensuring semantic consistency and interoperability.
In light of these findings, our methodology can be adapted to other domains by modifying three core components:
  • Domain-specific ontology design: Replace the mathematical ontology with a structured vocabulary of entities and relations relevant to the target domain (e.g., biomedical conditions and legal concepts).
  • Prompt specialization: Tailor prompt templates to reflect domain-relevant discourse structures and knowledge types using strategies like chain-of-thought prompting.
  • Validation frameworks: Incorporate domain-appropriate quality checks, such as expert review for clinical safety or consistency audits for legal correctness.
These adaptations enable the construction of precise, interpretable, and domain-aware KGs suitable for high-stakes applications beyond MKM.

4.4. Future Directions

Several promising avenues for future research have been identified:
  • Ontology Expansion: Expand the ontology to include mathematical and engineering concepts that support the direct execution of simulations or numerical methods when necessary.
  • Improved Document Conversion: Develop AI-driven OCR pipelines, inspired by systems like DocLing [63], to reliably convert mathematical PDFs into structured LaTeX. This advancement would broaden the pipeline’s applicability to legacy and unstructured sources [57].
  • Integration with Theorem Provers: Incorporate formal proof assistants (e.g., Lean and Coq) to validate and verify the correctness of extracted mathematical statements, strengthening the logical soundness of the knowledge base. Recent studies have shown impressive improvement regarding model performance when incorporating these tools into the LLM reasoning process [64,65].
  • Advanced Prompt Engineering and Reasoning Model Integration: Develop and integrate prompt templates and reasoning strategies that leverage cutting-edge reasoning models [66] to perform multi-step internal planning to excel regarding advanced mathematics and science tasks and reduce LLM hallucinations [67]. This can enhance semantic fidelity, and extract richer symbolic representations.
  • Bias Mitigation: Investigate and address the biases present in LLM outputs to ensure neutrality, reliability, and fairness in extracted content, particularly when deployed in educational or decision-support systems.
  • Dynamic KG Maintenance: Addressing the dynamic nature of mathematical knowledge (e.g., revised definitions and evolving curricula) is crucial. Future work will explore a pipeline involving (1) versioned storage of triples with validity intervals; (2) change-triggered re-indexing based on fine-grained dependency tracking; (3) automated outdated-fact detection, potentially adapting techniques like Deep Outdated Fact Detection [68], which leverage structural and textual cues; and (4) minimal-change revision mechanisms, including archiving or consistent replacement of flagged triples to maintain KG integrity.
  • KG Completion and Implicit Reasoning: A significant direction is enhancing KG completeness and inferring implicit knowledge beyond explicitly stated facts. Future work includes exploring text-enhanced KG completion methods. For instance, SimKGC [69] leverages contrastive learning on entity descriptions and is a natural baseline for our text-rich LaTeX sources. More recent neighborhood- and relation-aware models—KGC-ERC [70] and RAA-KGC [71]—set a new state of the art and should also be considered. To adapt these approaches to the mathematical domain, they should be combined with our mathematical-formula-retrieval pipeline (Section 2.4.4) so that symbols and equations provide additional domain-specific context for entities and relations.

5. Conclusions

This study has introduced and evaluated a novel framework integrating LLMs with an ontology-based approach for constructing and managing KGs specifically tailored for MKM. By automating the extraction, semantic organization, and logical verification of mathematical knowledge from structured LaTeX documents, our system offers a promising solution to the challenges of scalability, consistency, and precision in handling complex mathematical content.

5.1. Key Contributions

The primary contributions of this work can be summarized as follows:
  • Ontology-Driven Knowledge Modeling: We developed a lightweight yet expressive ontology capable of representing core mathematical structures, including hypotheses, theorems, and proofs. This structured representation underpins the system’s ability to perform advanced reasoning, validate curriculum coherence, and retrieve relevant methods accurately [1,2].
  • Automated Information Extraction Pipeline: A robust pipeline was implemented, leveraging state-of-the-art LLMs and vector retrieval models to accurately extract and structure formal statements from LaTeX source documents. This automation is key to enabling the ingestion and management of knowledge from large-scale repositories like arXiv, benefiting MKM [9,10].
  • Hybrid Cognitive Computing Framework: The integration of probabilistic LLM-based generation with symbolic ontology-based reasoning embodies a hybrid AI approach. This synergy bridges the gap between data-driven pattern recognition and formal logic, resulting in a knowledge representation framework that is potentially more robust, explainable, and suitable for knowledge-intensive applications in education and research [72].

5.2. Implications for MKM

The practical implications of this framework are multifaceted, offering tangible benefits for MKM:
  • Curriculum Validation and Model Verification: The system provides tools to ensure the logical flow and consistency of educational materials.
  • Intelligent Tutoring Systems: By enabling the generation of pseudo-demonstrations and the verification of prerequisite knowledge, the framework can serve as a valuable component in developing sophisticated AI-driven educational platforms.
  • Storage and Retrieval of Precise Methods: The KG acts as a repository for rigorously defined computational methods, such as precise algorithms implementing mathematical definitions or verifying properties, explicitly linked to the corresponding formal statements. This structured storage allows for the validation of computational implementations against their specifications, facilitates reliable retrieval, and promotes reproducibility and methodological rigor.

5.3. Acknowledged Limitations and Path Forward

While the results demonstrate the potential of our approach, realizing its full capabilities requires addressing certain limitations inherent in the current implementation and the underlying technologies. As detailed in the Discussion (Section 4.1.1), these include the system’s current dependence on well-structured LaTeX input, the persistent challenges associated with LLM performance in deep symbolic reasoning and potential for hallucination (despite mitigation strategies), and the necessarily bounded scope of the current ontology.
Our future research agenda, outlined in Section 4.4, is designed to directly address these points and enhance the system’s capabilities. The key directions include significantly expanding the ontology’s coverage into more specialized mathematical and engineering domains, developing more robust document conversion techniques (particularly for PDF sources), integrating formal verification tools like automated theorem provers to bolster logical soundness, and further refining LLM interaction strategies through advanced prompting and reasoning model integration to improve semantic fidelity and mitigate biases.

5.4. Final Remarks

In summary, this research demonstrates the significant potential of synergizing LLMs and ontology-driven KGs for advancing MKM. The focus on formal representation, automated and scalable information extraction, and logical coherence provides a powerful foundation for managing mathematical knowledge effectively, applicable in educational and research contexts. As symbolic and subsymbolic AI paradigms continue to evolve, this hybrid approach offers a promising trajectory for building more intelligent, interpretable, and domain-aware knowledge management systems.

Author Contributions

Conceptualization, A.L.-S. and J.B.-D.; methodology, A.L.-S. and J.B.-D.; software, A.L.-S.; validation, A.L.-S. and J.B.-D.; formal analysis, A.L.-S. and J.B.-D.; investigation, A.L.-S. and J.B.-D.; resources, A.L.-S. and J.B.-D.; data curation, A.L.-S.; writing—original draft preparation, A.L.-S. and J.B.-D.; writing—review and editing, A.L.-S. and J.B.-D.; visualization, A.L.-S.; supervision, J.B.-D.; project administration, A.L.-S. and J.B.-D.; funding acquisition, J.B.-D. All authors have read and agreed to the published version of the manuscript.

Funding

Grant PID2023-147198NB-I00 funded by MICIU/AEI/10.13039/501100011033 (Agencia Estatal de Investigación), Spain, and by FEDER, UE.

Data Availability Statement

The source code for the Chatex WebUI, the information extraction pipeline scripts, prompt templates, the ontology file, and example data used for evaluation are publicly available from GitLab at https://gitlab.com/universidad4774909/tfg/chatex-webui (accessed on 20 February 2025). The synthetic dataset generation script and benchmark details can be found at https://gitlab.com/academic15/llm-synthetic-data-generation-mathematical-conlcusions-and-hypotheses (accessed on 20 February 2025) and https://gitlab.com/academic15/custom-benchmarking-llms (accessed on 20 February 2025), respectively.

Acknowledgments

We sincerely thank the reviewers for their valuable suggestions, which have helped us to improve the clarity, completeness, and overall quality of the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Prompt Template: Hypothesis Extraction

Listing A1. Prompt template used for extracting hypotheses from mathematical theorems.
Modelling 06 00053 i001

References

  1. Kohlhase, M. OMDoc—An Open Markup Format for Mathematical Documents, Version 1.2; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4180, pp. XIX, 432. [Google Scholar] [CrossRef]
  2. Elizarov, A.; Kirillovich, A.; Lipachev, E.; Nevzorova, O. Digital Ecosystem OntoMath: Mathematical Knowledge Analytics and Management. In Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2016; Communications in Computer and Information Science; Kalinichenko, L., Kuznetsov, S., Manolopoulos, Y., Eds.; Springer: Cham, Switzerland, 2017; Volume 706, pp. 34–45. [Google Scholar] [CrossRef]
  3. Weikum, G.; Dong, X.L.; Razniewski, S.; Suchanek, F.M. Machine knowledge: Creation and curation of comprehensive knowledge bases. Found. Trends Databases 2021, 10, 108–490. [Google Scholar] [CrossRef]
  4. Xue, B.; Zou, L. Knowledge Graph Quality Management: A Comprehensive Survey. IEEE Trans. Knowl. Data Eng. 2023, 35, 4969–4988. [Google Scholar] [CrossRef]
  5. Tonmoy, S.M.T.I.; Zaman, S.M.M.; Jain, V.; Rani, A.; Rawte, V.; Chadha, A.; Das, A. A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models. arXiv 2024, arXiv:2401.01313. [Google Scholar]
  6. Huang, L.; Yu, W.; Zhang, W.; Tian, Y.; Qiu, S.; Liu, C.; Niu, D.; Yue, D.; Wu, J.R.; Wang, J. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv 2024, arXiv:2311.05232. [Google Scholar] [CrossRef]
  7. Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2024, arXiv:2312.10997. [Google Scholar]
  8. Guo, T.; Yang, Q.; Wang, C.; Liu, Y.; Li, P.; Tang, J.; Li, D.; Wen, Y. KnowledgeNavigator: Leveraging large language models for enhanced reasoning over knowledge graph. Complex Intell. Syst. 2024, 10, 7063–7076. [Google Scholar] [CrossRef]
  9. Pan, S.; Luo, L.; Wang, Y.; Chen, C.; Wang, J.; Wu, X. Unifying Large Language Models and Knowledge Graphs: A Roadmap. arXiv 2023, arXiv:2306.08302. [Google Scholar] [CrossRef]
  10. Mirzadeh, I.; Alizadeh, K.; Shahrokhi, H.; Tuzel, O.; Bengio, S.; Farajtabar, M. GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. arXiv 2024, arXiv:2410.05229. [Google Scholar]
  11. Luo, L.; Zhao, Z.; Haffari, G.; Gong, C.; Pan, S. Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models. arXiv 2024, arXiv:2404.12954. [Google Scholar]
  12. Rahman, A.M.M.; Yang, T.; Yang, M.; Zhao, H.; Yang, T. LemmaHead: RAG Assisted Proof Generation Using Large Language Models. arXiv 2025, arXiv:2501.15797. [Google Scholar]
  13. Xena Project. Lean in 2024. Blog Post. 2024. Available online: https://xenaproject.wordpress.com/2024/01/20/lean-in-2024/ (accessed on 20 February 2025).
  14. Buzzard, K. The Xena Project. Online Talk/Blog. 2021. Ongoing Project. Available online: https://xenaproject.wordpress.com/ (accessed on 20 February 2025).
  15. Dixit, P.; Oates, T. SBI-RAG: Enhancing Math Word Problem Solving for Students Through Schema-Based Instruction and Retrieval-Augmented Generation. arXiv 2024, arXiv:2410.13293. [Google Scholar]
  16. Hogan, A.; Blomqvist, E.; Cochez, M.; d’Amato, C.; de Melo, G.; Gutierrez, C.; Kirrane, S.; Gayo, J.E.L.; Navigli, R.; Neumaier, S.; et al. Knowledge graphs. ACM Comput. Surv. 2021, 54, 1–37. [Google Scholar] [CrossRef]
  17. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2023, arXiv:2201.11903. [Google Scholar]
  18. Kwon, W.; Li, Z.; Zhuang, S.; Sheng, Y.; Zheng, L.; Yu, C.H.; Gonzalez, J.E.; Zhang, H.; Stoica, I. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv 2023, arXiv:2309.06180. [Google Scholar]
  19. Gurajada, S.; Seufert, S.; Miliaraki, I.; Theobald, M. TriAD: A distributed shared-nothing RDF engine based on asynchronous message passing. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 22–27 June 2014; pp. 289–300. [Google Scholar]
  20. Schwarte, A.; Haase, P.; Hose, K.; Schenkel, R.; Schmidt, M. FedX: A Federation Layer for Distributed Query Processing on Linked Open Data. In The Semantic Web: Research and Applications, Proceedings of the 8th Extended Semantic Web Conference, (ESWC 2011), Heraklion, Greece, 29 May–2 June 2011; LNCS 6644; Springer: Berlin/Heidelberg, Germany, 2011; pp. 481–486. [Google Scholar] [CrossRef]
  21. Dadzie, A.S.; Rowe, M. Approaches to Visualising Linked Data: A Survey. Semant. Web 2011, 2, 89–124. [Google Scholar] [CrossRef]
  22. Katifori, A.; Halatsis, C.; Lepouras, G.; Vassilakis, C.; Giannopoulou, E. Ontology visualization methods—A survey. ACM Comput. Surv. 2007, 39, 10-es. [Google Scholar] [CrossRef]
  23. Heer, J.; Shneiderman, B. Interactive dynamics for visual analysis. Commun. ACM 2012, 55, 45–54. [Google Scholar] [CrossRef]
  24. Shneiderman, B. The eyes have it: A task by data type taxonomy for information visualizations. In Proceedings of the 1996 IEEE Symposium on Visual Languages, Boulder, CO, USA, 3–6 September 1996; pp. 336–343. [Google Scholar] [CrossRef]
  25. Wang, J.; Huang, J.X.; Tu, X.; Wang, J.; Huang, A.J.; Laskar, M.T.R.; Bhuiyan, A. Utilizing BERT for Information Retrieval: Survey, Applications, Resources, and Challenges. arXiv 2024, arXiv:2403.00784. [Google Scholar] [CrossRef]
  26. Reusch, A.; Thiele, M.; Lehner, W. Transformer-Encoder and Decoder Models for Questions on Math. In Proceedings of the Conference and Labs of the Evaluation Forum, CLEF 2022, Bologna, Italy, 5–8 September 2022. [Google Scholar]
  27. Kohlhase, A.; Kovács, L. (Eds.) Intelligent Computer Mathematics. In Proceedings of the 17th International Conference, CICM 2024, Montréal, QC, Canada, 5–9 August 2024; Proceedings, Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2024; Volume 14960. [Google Scholar]
  28. Novotný, V.; Štefánik, M. Combining Sparse and Dense Information Retrieval. In Working Notes of CLEF 2022, Proceedings of the Conference and Labs of the Evaluation Forum, CLEF 2022, Bologna, Italy, 5–8 September 2022; Faggioli, G., Ferro, N., Hanbury, A., Potthast, M., Eds.; CEUR-WS: Bologna, Italy, 2022; pp. 104–118. [Google Scholar]
  29. Zanibbi, R.; Mansouri, B.; Agarwal, A. Mathematical Information Retrieval: Search and Question Answering. arXiv 2024, arXiv:2408.11646. [Google Scholar] [CrossRef]
  30. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Tau Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2021, arXiv:2005.11401. [Google Scholar]
  31. Sugiyama, K.; Tagawa, S.; Toda, M. Methods for Visual Understanding of Hierarchical System Structures. IEEE Trans. Syst. Man, Cybern. 1981, 11, 109–125. [Google Scholar] [CrossRef]
  32. Mirac Suzgun, A.T.K. Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding. arXiv 2024, arXiv:2401.12954. [Google Scholar]
  33. Qi, Z.; Ma, M.; Xu, J.; Zhang, L.L.; Yang, F.; Yang, M. Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers. arXiv 2024, arXiv:2408.06195. [Google Scholar]
  34. Chen, W.; Ma, X.; Wang, X.; Cohen, W.W. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv 2022, arXiv:2211.12588. [Google Scholar]
  35. Khot, T.; Trivedi, H.; Finlayson, M.; Fu, Y.; Richardson, K.; Clark, P.; Sabharwal, A. Decomposed prompting: A modular approach for solving complex tasks. arXiv 2022, arXiv:2210.02406. [Google Scholar]
  36. Universidad de Sevilla, Facultad de Matemáticas. Plan de Estudios del Grado en Matemáticas. 2009. Available online: https://matematicas.us.es/titulaciones/grado-en-matematicas/presentacion/plan-de-estudios-del-grado-en-matematicas (accessed on 20 February 2025).
  37. Dijkstra, E.W. A note on two problems in connexion with graphs. Numer. Math. 1959, 1, 269–271. [Google Scholar] [CrossRef]
  38. Kahn, A.B. Topological sorting of large networks. Commun. ACM 1962, 5, 558–562. [Google Scholar] [CrossRef]
  39. Gao, B.; Song, F.; Yang, Z.; Cai, Z.; Miao, Y.; Dong, Q.; Li, L.; Ma, C.; Chen, L.; Xu, R.; et al. Omni-MATH: A Universal Olympiad Level Mathematic Benchmark for Large Language Models. arXiv 2024, arXiv:2410.07985. [Google Scholar]
  40. Ming, Y.; Purushwalkam, S.; Pandit, S.; Ke, Z.; Nguyen, X.P.; Xiong, C.; Joty, S. FaithEval: Can Your Language Model Stay Faithful to Context, Even If “The Moon is Made of Marshmallows”. arXiv 2024, arXiv:2410.03727. [Google Scholar]
  41. Card, D.; Henderson, P.; Khandelwal, U.; Jia, R.; Mahowald, K.; Jurafsky, D. With Little Power Comes Great Responsibility. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Virtually, 16–20 November 2020; Webber, B., Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 9263–9274. [Google Scholar] [CrossRef]
  42. Bowman, S.R.; Dahl, G. What Will it Take to Fix Benchmarking in Natural Language Understanding? In Human Language Technologies, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, Online, 6–11 June 2021; Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 4843–4855. [Google Scholar] [CrossRef]
  43. Meta. Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date. Available online: https://ai.meta.com/blog/meta-llama-3/ (accessed on 20 February 2025).
  44. Abdin, M.; Aneja, J.; Awadalla, H.; Awadallah, A.; Awan, A.A.; Bach, N.; Bahree, A.; Bakhtiari, A.; Bao, J.; Behl, H.; et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv 2024, arXiv:2404.14219. [Google Scholar]
  45. Team, G.; Riviere, M.; Pathak, S.; Sessa, P.G.; Hardin, C.; Bhupatiraju, S.; Hussenot, L.; Mesnard, T.; Shahriari, B.; Ramé, A.; et al. Gemma 2: Improving Open Language Models at a Practical Size. arXiv 2024, arXiv:2408.00118. [Google Scholar]
  46. Mistral AI. Mathstral: Accelerating Mathematical Discovery with AI. Available online: https://mistral.ai/news/mathstral/ (accessed on 31 November 2024).
  47. Yang, A.; Yang, B.; Hui, B.; Zheng, B.; Yu, B.; Zhou, C.; Li, C.; Li, C.; Liu, D.; Huang, F.; et al. Qwen2 technical report. arXiv 2024, arXiv:2407.10671. [Google Scholar]
  48. Team, Q. Qwen2.5: A Party of Foundation Models. arXiv 2024, arXiv:2412.15115. [Google Scholar]
  49. Dubey, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
  50. Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Hanna, E.B.; Bressand, F.; et al. Mixtral of Experts. arXiv 2024, arXiv:2401.04088. [Google Scholar]
  51. Qwen Team. Introducing Qwen1.5. Qwen Blog (Blog), 4 February 2024. Available online: https://qwenlm.github.io/blog/qwen1.5/ (accessed on 20 February 2025).
  52. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
  53. Agrawal, G.; Kumarage, T.; Alghamdi, Z.; Liu, H. Can Knowledge Graphs Reduce Hallucinations in LLMs? A Survey. arXiv 2023, arXiv:2311.07914. [Google Scholar]
  54. Zhao, J.; Tong, J.; Mou, Y.; Zhang, M.; Zhang, Q.; Huang, X. Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning. arXiv 2024, arXiv:2405.06680. [Google Scholar]
  55. Boye, J.; Moell, B. Large Language Models and Mathematical Reasoning Failures. arXiv 2025, arXiv:2502.11574. [Google Scholar]
  56. Subramanian, S.; Elango, V.; Gungor, M. Small Language Models (SLMs) Can Still Pack a Punch: A survey. arXiv 2025, arXiv:2501.05465. [Google Scholar]
  57. Zhang, Q.; Wang, B.; Huang, V.S.J.; Zhang, J.; Wang, Z.; Liang, H.; He, C.; Zhang, W. Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction. arXiv 2025, arXiv:2410.21169. [Google Scholar]
  58. Mathpix. Math Pix PDF to LaTeX. Available online: https://mathpix.com/pdf-to-latex (accessed on 20 February 2025).
  59. HuggingFace. trOCR. Available online: https://huggingface.co/docs/transformers/model_doc/trocr (accessed on 20 February 2025).
  60. Hao, S.; Gu, Y.; Ma, H.; Hong, J.J.; Wang, Z.; Wang, D.Z.; Hu, Z. Reasoning with Language Model is Planning with World Model. arXiv 2023, arXiv:2305.14992. [Google Scholar]
  61. Arsenyan, V.; Bughdaryan, S.; Shaya, F.; Small, K.W.; Shahnazaryan, D. Large Language Models for Biomedical Knowledge Graph Construction: Information extraction from EMR notes. In Proceedings of the 23rd Workshop on Biomedical Natural Language Processing, Bangkok, Thailand, 16 August 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 295–317. [Google Scholar] [CrossRef]
  62. Feng, X.; Wu, X.; Meng, H. Ontology-grounded Automatic Knowledge Graph Construction by LLM under Wikidata Schema. arXiv 2024, arXiv:2412.20942. [Google Scholar]
  63. Auer, C.; Lysak, M.; Nassar, A.; Dolfi, M.; Livathinos, N.; Vagenas, P.; Ramis, C.B.; Omenetti, M.; Lindlbauer, F.; Dinkla, K.; et al. Docling Technical Report. arXiv 2024, arXiv:2408.09869. [Google Scholar]
  64. Yang, K.; Swope, A.M.; Gu, A.; Chalamala, R.; Song, P.; Yu, S.; Godil, S.; Prenger, R.; Anandkumar, A. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. arXiv 2023, arXiv:2306.15626. [Google Scholar]
  65. Wang, R.; Zhang, J.; Jia, Y.; Pan, R.; Diao, S.; Pi, R.; Zhang, T. TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts. arXiv 2024, arXiv:2407.03203. [Google Scholar]
  66. Ballon, M.; Algaba, A.; Ginis, V. The Relationship Between Reasoning and Performance in Large Language Models–o3 (mini) Thinks Harder, Not Longer. arXiv 2025, arXiv:2502.15631. [Google Scholar]
  67. Snell, C.; Lee, J.; Xu, K.; Kumar, A. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv 2024, arXiv:2408.03314. [Google Scholar]
  68. Tu, H.; Yu, S.; Saikrishna, V.; Xia, F.; Verspoor, K. Deep Outdated Fact Detection in Knowledge Graphs. In Proceedings of the 2023 IEEE International Conference on Data Mining Workshops (ICDMW), Shanghai, China, 4 December 2023; pp. 1443–1452. [Google Scholar] [CrossRef]
  69. Wang, L.; Zhao, W.; Wei, Z.; Liu, J. SimKGC: Simple Contrastive Knowledge Graph Completion with Pre-trained Language Models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1, pp. 4281–4294. [Google Scholar] [CrossRef]
  70. Chen, J.; Zhang, K.; Gan, A.; Tong, S.; Shen, S.; Liu, Q. Enhancing Knowledge Graph Completion with Entity Neighborhood and Relation Context. arXiv 2025, arXiv:2503.23205. [Google Scholar]
  71. Yuan, D.; Zhou, S.; Chen, X.; Wang, D.; Liang, K.; Liu, X.; Huang, J. Knowledge Graph Completion with Relation-Aware Anchor Enhancement. arXiv 2025, arXiv:2504.06129. [Google Scholar] [CrossRef]
  72. Satpute, A.; Giessing, N.; Greiner-Petter, A.; Schubotz, M.; Teschke, O.; Aizawa, A.; Gipp, B. Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange. arXiv 2024, arXiv:2404.00344. [Google Scholar]
Figure 1. Diagram illustrating the different phases of the workflow of the proposed information extraction system.
Figure 1. Diagram illustrating the different phases of the workflow of the proposed information extraction system.
Modelling 06 00053 g001
Figure 2. Homepage of Chatex, the system for extracting information from LaTeX documents. The page provides initial instructions for uploading LaTeX files and populating the knowledge base.
Figure 2. Homepage of Chatex, the system for extracting information from LaTeX documents. The page provides initial instructions for uploading LaTeX files and populating the knowledge base.
Modelling 06 00053 g002
Figure 4. Page for uploading LaTeX files to populate the Chatex knowledge base. Users can drag and drop files or browse to select them manually.
Figure 4. Page for uploading LaTeX files to populate the Chatex knowledge base. Users can drag and drop files or browse to select them manually.
Modelling 06 00053 g004
Figure 5. Interface for configuring information extraction in Chatex, including the selection of LaTeX elements to process and the assignment of labels by chapter.
Figure 5. Interface for configuring information extraction in Chatex, including the selection of LaTeX elements to process and the assignment of labels by chapter.
Modelling 06 00053 g005
Figure 6. Log page displaying the parsed LaTeX content, including processed definitions, associated symbols, and extraction metadata for the selected chapters.
Figure 6. Log page displaying the parsed LaTeX content, including processed definitions, associated symbols, and extraction metadata for the selected chapters.
Modelling 06 00053 g006
Figure 7. Interface for querying and classifying statements similar to a given mathematical expression. Users can specify the number of results, adjust the minimum confidence threshold, and assign retrieved statements to different categories, such as assumptions or implications.
Figure 7. Interface for querying and classifying statements similar to a given mathematical expression. Users can specify the number of results, adjust the minimum confidence threshold, and assign retrieved statements to different categories, such as assumptions or implications.
Modelling 06 00053 g007
Figure 8. Interface for advanced reasoning tasks, showing an example of the Maximize Conclusions functionality. Users can explore assumptions and derived implications, with the underlying algorithms described in Section 2.8.
Figure 8. Interface for advanced reasoning tasks, showing an example of the Maximize Conclusions functionality. Users can explore assumptions and derived implications, with the underlying algorithms described in Section 2.8.
Modelling 06 00053 g008
Figure 9. Ontology visualization interface showing a spring-embedded representation of an RDF graph. Users can selectively display predicates (e.g., assumesThat and impliesThat) and choose between different layout algorithms to explore the structure of the knowledge base.
Figure 9. Ontology visualization interface showing a spring-embedded representation of an RDF graph. Users can selectively display predicates (e.g., assumesThat and impliesThat) and choose between different layout algorithms to explore the structure of the knowledge base.
Modelling 06 00053 g009
Figure 11. Example JSON entry from the synthetic dataset, representing a theorem about finite cyclic groups. Each entry specifies associated definitions, hypotheses, and conclusions, along with a field label for the mathematical domain.
Figure 11. Example JSON entry from the synthetic dataset, representing a theorem about finite cyclic groups. Each entry specifies associated definitions, hypotheses, and conclusions, along with a field label for the mathematical domain.
Modelling 06 00053 g011
Figure 12. Distribution of samples in the synthetic dataset across different mathematical fields. The dataset maintains an approximately uniform distribution to ensure balanced representation across areas such as real analysis, number theory, topology, and group theory.
Figure 12. Distribution of samples in the synthetic dataset across different mathematical fields. The dataset maintains an approximately uniform distribution to ensure balanced representation across areas such as real analysis, number theory, topology, and group theory.
Modelling 06 00053 g012
Figure 13. Mean execution time (in seconds) per model for computing hypotheses and conclusions. Larger models generally exhibit longer processing times.
Figure 13. Mean execution time (in seconds) per model for computing hypotheses and conclusions. Larger models generally exhibit longer processing times.
Modelling 06 00053 g013
Table 1. Comparison of embedding models for definition retrieval.
Table 1. Comparison of embedding models for definition retrieval.
ModelAccuracyRecallExecution Time (s)
Bert-MLM_arXiv-MP-class_zbMath0.62810.628126.8666
math_pretrained_bert0.63770.637728.3329
Bert-MLM_arXiv0.67060.670629.6985
Table 2. Metrics for small models.
Table 2. Metrics for small models.
ModelAccuracyPrecisionRecallF1 Score
gemma2:2B0.74150.73110.97670.7415
llama3.2:3B0.72810.71740.96510.7281
phi3.5:latest0.74660.74360.96900.7466
Table 3. Metrics for medium models.
Table 3. Metrics for medium models.
ModelAccuracyPrecisionRecallF1 Score
mathstral:latest0.73840.73080.96900.7384
qwen2.5:7B0.73620.72910.96900.7362
qwen2-math:7B0.73800.72380.97670.7380
gemma2:9B0.73280.72570.96900.7328
llama3:8B-instruct0.73360.72180.96900.7336
llama3.1:8B0.73740.72340.97670.7374
Table 4. Metrics for large models.
Table 4. Metrics for large models.
ModelAccuracyPrecisionRecallF1 Score
mixtral:8x7B-instruct-v0.1-fp160.73170.71760.97670.7317
command-r:35B0.73520.72110.97670.7352
llama2:70B-chat0.73190.72030.96900.7319
llama3:70B-instruct0.72470.71450.95350.7480
llama3.1:70B0.72580.71800.95350.7491
qwen1.5:72B0.74020.72610.97670.7402
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lobo-Santos, A.; Borrego-Díaz, J. Enhancing Mathematical Knowledge Graphs with Large Language Models. Modelling 2025, 6, 53. https://doi.org/10.3390/modelling6030053

AMA Style

Lobo-Santos A, Borrego-Díaz J. Enhancing Mathematical Knowledge Graphs with Large Language Models. Modelling. 2025; 6(3):53. https://doi.org/10.3390/modelling6030053

Chicago/Turabian Style

Lobo-Santos, Antonio, and Joaquín Borrego-Díaz. 2025. "Enhancing Mathematical Knowledge Graphs with Large Language Models" Modelling 6, no. 3: 53. https://doi.org/10.3390/modelling6030053

APA Style

Lobo-Santos, A., & Borrego-Díaz, J. (2025). Enhancing Mathematical Knowledge Graphs with Large Language Models. Modelling, 6(3), 53. https://doi.org/10.3390/modelling6030053

Article Metrics

Back to TopTop