DL-ReasonSuite: A Benchmark for Evaluating Description Logic Reasoning in Large Language Models

Oluçoğlu, Müge; Bursa, Okan

doi:10.3390/app16041821

Open AccessArticle

DL-ReasonSuite: A Benchmark for Evaluating Description Logic Reasoning in Large Language Models

by

Müge Oluçoğlu

^*

and

Okan Bursa

Computer Engineering Department, Faculty of Engineering and Architecture, Izmir Bakircay University, 35665 Izmir, Turkey

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(4), 1821; https://doi.org/10.3390/app16041821

Submission received: 17 January 2026 / Revised: 2 February 2026 / Accepted: 5 February 2026 / Published: 12 February 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Large language models (LLMs) have shown remarkable progress in general reasoning and understanding, but their ability to perform formal logical reasoning remains under-explored. In this paper, we introduce DLReasonSuite, a novel benchmark designed to rigorously evaluate LLMs on reasoning tasks grounded in Description Logic (DL). DL-ReasonSuite comprises 4740 tasks spanning seven distinct task types and organized into three reasoning tracks: (1) DLCore, covering fundamental ontology reasoning tasks (consistency checking, subsumption, and instance checking); (2) DLQuery, focusing on answering entailment-aware SPARQL queries; and (3) DLBridge, bridging natural language and formal logic (bidirectional NL ↔ OWL translation and tool-augmented entailment resolution). We detail the methodology for designing and implementing this benchmark, including task construction, automatic evaluation metrics and validation using reliable OWL reasoners. Then, we present an empirical evaluation of five leading reasoning LLMs as stateofart models: Kimi k1.5, LlamaNemotron Ultra, DeepSeekR1, Phi4 Reasoning Plus, and Phi4 Reasoning on the full suite of tasks. Our results reveal significant variability in LLM performance on formal reasoning was observed. While the best model, Phi4 Reasoning Plus, achieves an overall accuracy of 85% and excels especially in tool-augmented tasks, other models struggle notably with complex query reasoning for DL and precise OWL translation. We analyze the strengths and weaknesses of each model across different DL metrics and task categories, providing insights into current limitations of LLM reasoning such as handling SPARQL queries and maintaining logical consistency and the benefits of neuro-symbolic techniques. DL-ReasonSuite is a comprehensive framework for assessing and advancing LLMs’ Description Logic reasoning capabilities aiming to bridge the gap between natural language understanding and formal knowledge representation.

Keywords:

large language models; reasoning; description logic; neuro-symbolic reasoning; ontology reasoning; OWL 2; SPARQL; benchmark suite; knowledge graph; reasoning metrics

1. Introduction

Reasoning is a core component of human intelligence and a longstanding challenge in artificial intelligence. Recent advances in large language models (LLMs) have demonstrated surprising capabilities for reasoning in natural language contexts, especially as model scale increases. For instance, state-of-the-art LLMs like GPT-4 have been noted as “advanced” at many reasoning tasks and they exhibit emergent reasoning behaviors when provided with Chain-of-Thought prompts. Despite such progress, it remains unclear to what extent these models truly understand and reason versus reciting learned patterns. In fact, studies have shown that LLMs can fail on logical puzzles or planning problems that are trivial for humans, raising concerns that they may be “stochastic parrots”, producing fluent answers without genuine reasoning [1]. A key open question is whether LLMs are actually performing logical reasoning or merely memorizing solutions seen during training. As a result, there is growing consensus that more rigorous and unbiased evaluation of LLMs’ reasoning ability is needed [2]. To accurately assess reasoning, new benchmarks must ensure models cannot rely on memorized answers and must truly employ reasoning skills to solve novel problems [3].

In this context, formal knowledge representation frameworks play a critical role in defining what constitutes correct reasoning. In the fields of the Semantic Web and knowledge representation, Description Logics (DLs) constitute a family of formal logics based on concepts and relations and form the theoretical foundation of ontologies [4]. In particular, the Web Ontology Language (OWL) standard (https://www.w3.org/TR/owl-ref/) (accessed on 17 January 2026) is grounded in DL principles and provides a formal basis for knowledge sharing in semantic web environments. The second version of OWL, namely OWL-2 (https://www.w3.org/TR/owl2-overview/) (accessed on 17 January 2026), significantly increases the expressivity of the language compared to its predecessor and is based on the highly expressive Description Logic SROIQ [5,6]. This increased expressivity enables the modeling of complex ontological constructs—such as role hierarchies, qualified cardinality restrictions, and advanced property characteristics—and supports automated reasoning tasks including consistency checking, classification, and hierarchy inference through dedicated DL-based reasoning engines.

A variety of automated reasoning systems have been developed to perform logical inference over DL-based ontologies. Widely adopted reasoners such as Pellet and HermiT support key reasoning services for OWL DL ontologies, including constraint satisfaction, classification and query answering [7,8]. These tools enable the derivation of implicit knowledge from explicitly stated axioms for a domain and play a critical role in semantic applications to deduce meaningful results from the domain knowledge. Over time, extensive efforts have been devoted to evaluating the performance and scalability of such DL reasoners. Notably, the OWL Reasoner Evaluation (ORE) 2015 competition provided a comprehensive comparative assessment of state-of-the-art OWL reasoners across a variety of reasoning tasks and datasets, highlighting their strengths and limitations under different conditions [9].

To facilitate systematic evaluation, several benchmark datasets have been proposed for knowledge-based systems. Among the earliest and most influential is the Lehigh University Benchmark (LUBM), which employs a synthetic university-domain ontology and a fixed set of queries to measure the performance of knowledge-base systems over OWL [10]. Preliminary work extended LUBM into the University Ontology Benchmark (UOBM), incorporating a broader range of OWL language constructs to better reflect realistic reasoning scenarios [11]. More recently, OWL2Bench was introduced as a customizable benchmark designed specifically to evaluate OWL 2 reasoners under diverse and configurable settings [12]. In addition to synthetic benchmarks, large-scale collections of real-world ontologies have also been compiled to support empirical analysis. For example, the Manchester OWL Corpus provides a curated collection of OWL DL ontologies drawn from multiple domains sufficient to evaluate on realistic and heterogeneous knowledge bases [13].

These ontologies were build on logical language families and in our case, DL, which form the logical foundation of the Web Ontology Language (OWL), have historically maintained a balance between expressive power and computational decidability. However, limitations encountered during current research have shifted towards addressing the distinctive nature of real-world data, such as widespread inconsistencies, deficiencies and the need for explainable results in high-risk areas like clinical decision support and manufacturing process planning [14]. Even though OWL-2 describes new versions of DL families to eliminate these risks in real-world data to capture the true semantic level with OWL Profiles (https://www.w3.org/TR/2012/REC-owl2-profiles-20121211/) (accessed on 17 January 2026), the representation of real-world entities to logical constructs could stil lead to missing satisfiability problems over the models. The literature review for this study is characterized by strong investigations into the development of neuro-symbolic architectures, the formalization of non-monotonic semantics for complex terminologies and the complexity of minimal model reasoning [14,15].

DL such as

ALC (D)

, which allow referencing qualitative and quantitative values through concrete domains D, have been subjected to refined complexity analyses [16]. The research has shown that deciding the consistency of an ontology

ALC (D)

is ExpTime-complete when the concrete domain D is

ω

-admissible and the constraint satisfaction problem is decidable in exponential time. This finding is critical for the integration of spatial and temporal reasoning into standard DL frameworks, as it provides a predictable complexity limit for developers of automated reasoning systems.

In parallel with these studies, the

EL

logic family, known for its use in large-scale biomedical ontologies such as SNOMED CT, has been extended to handle different levels of abstraction [17]. The inclusion of abstraction and refinement operators allows researchers to reason at different levels of detail. For example, an aircraft can be viewed as a single entity at one level, and as a collection of parts at a more finely detailed level. While the inclusion of these operators in highly expressive logics like

ALC

often results in 2ExpTime-full complexity, recent work has defined polynomial-time parts within the

EL

family using set-based ensemble semantics. The answer to a counting query is an integer or infinity. Its spectrum is the set of answers over all models of a knowledge base [18]. The authors determined that the spectra for the

ALCIF

ontologies have simple shapes, typically closed subsets of natural numbers under summation. They proved that an efficient representation computation for these spectra is

F P^{N P [log]}

-complete [18].

Unlike traditional open-world semantics, minimal model reasoning assumes that a truth should be considered false if it is not validated by the knowledge base. This principle is central to non-monotonic formalisms such as Answer Set Programming (ASP) and circumscription [19]. Concept satisfiability in minimal models is undecidable, even for a lightweight logic like

EL

[20]. To regain decidability, researchers have proposed strong and weak acyclic conditions on TBoxes that reduce the combined complexity to NExpTime-complete or

N E x p^{N P}

-complete, respectively [21]. This situation means that the desire for more “intuitive” closed-world reasoning in modern DL research often leads to a significant jump in computational cost. By proving that standard reasoning problems are ExpTime-complete, they demonstrated that non-monotonic stable model semantics are computationally no more expensive than classical descriptive semantics [22].

Research conducted by [22] has been instrumental in defining a stable model semantics for default negation and DL terminologies that naturally support both assumptions. In Quantified Equilibrium Logic (QEL), the need for Scolemization has been have eliminated which is a significant bottleneck in previous attempts to unify rules and ontologies [14]. Their work on

ALCI

terminologies has proven that standard reasoning tasks such as concept satisfactoriness and subsumption remain ExpTime-complete under stable model semantics. This result makes a significant contribution to the literature. It shows that the benefits of non-monotonic reasoning can be achieved for

ALCI

without exceeding the worst-case complexity of classical descriptive semantics.

Defeasible reasoning, which involves reasoning with statements that are “generally” true but allow for exceptions, has made significant progress with the inclusion of System W and Lexical Closure (LC) in the

DALC

logic. The primary motivation of this work is to address the choking problem found in Rational Closure (RC), previously the standard approach for refutable reasoning in definition logics. Unlike RC, which relies on a single order-based arrangement of interpretations, System W performs non-monotonic reasoning by comparing interpretations based on the sets of refutable axioms violated in each order. It prioritizes more specific information. This improvement allows System W to yield decidedly more informative and intuitively justified conclusions while retaining the desirable inferential properties of RC. Furthermore, the resulting inferential relation satisfies the System P assumptions, including Careful Monotonicity and Left Logical Equivalence, and provides a faithful generalization of their propositional counterparts [23].

The findings in the study [24] provided a central overview of techniques for explaining both why something is logically inferred (positive) and why it is not (negative). Explaining positive conclusions typically involves justifications, proofs, and Craig interpolation [25]. Justifications find the subset—the minimal set of axioms—that supports the conclusion. Proofs establish a sequence of inference steps that a human user can follow. Craig interpolation uses interpolants to bridge the gap between the terminology of the axioms and the terminology of the conclusion. For negative conclusions, abductive reasoning identifies what information is missing in the knowledge base for the inference to be valid [24]. This is important for debugging missing ontologies and communicating the behavior of reasoning systems to non-expert users.

Contrastive explanations for ABox reasoning answer the question of why one individual possesses a property while another does not [26]. By focusing on the relevant commonalities and differences between individuals, contrastive explanations draw attention to specific factors that lead to different classification outcomes [24].

The integration of sub-symbolic methods with symbolic reasoning has emerged as a dominant strategy for achieving scalability and robustness. Traditional symbolic reasoning tools are extremely susceptible to noise so that a single logical contradiction can render the entire knowledge base irrelevant for reasoning [14]. Embedding-Based Reasoner (EBR) approximates symbolic reasoning using knowledge graph embeddings and is designed to handle large-scale, incomplete and inconsistent knowledge bases. Experimental results have shown that EBR-based studies perform better than traditional symbolic approaches on datasets with missing assertions which opens a way for better reasoners with low number of training samples and properties [27].

As a result of evolving technologies and research, the role of large language models (LLMs) in this ecosystem is also changing. Rather than treating LLMs as independent systems prone to logical errors, recent work has explored their use in generating Chain-of-Thought statements that are verified by external symbolic solvers. The Logic-RAG framework uses Description Logics as a fundamental mechanism to ensure that generated outputs remain consistent with a verified knowledge graph [28].

Most existing reasoning benchmarks for LLMs focus on natural language logic puzzles, mathematical word problems or commonsense reasoning. However, beyond coverage and task diversity, an additional and more fundamental limitation emerges when these benchmarks are considered in the context of formal logical reasoning. Large language models (LLMs) are designed to mimic deterministic outcomes, whereas logical reasoning requires a higher level of certainty compared to statistical guessing. Thus, many real-world knowledge-driven applications require reasoning with formal logic representations. This need has motivated extensive benchmarking efforts within the Semantic Web community, beginning with the Lehigh University Benchmark (LUBM) (https://swat.cse.lehigh.edu/projects/lubm/) (accessed on 17 January 2026), which provides a scalable ontology and a set of ABox queries for evaluating reasoning capability and scalability. Subsequent benchmarks such as UOBM (https://www.cs.ox.ac.uk/isg/tools/UOBMGenerator/) (accessed on 17 January 2026), OntoBench [29], and the OWL Reasoner Evaluation (ORE) framework [9] extended this line of work by covering additional OWL constructs, real-world ontologies and broader reasoning scenarios. DL-ReasonSuite is fundamentally different from existing OWL reasoner benchmarks such as OWL2Bench in both its evaluation objective and task design. Traditional OWL benchmarks are primarily concerned with assessing the performance, scalability, and completeness of symbolic reasoners operating over large ontologies with success measured in terms of runtime, memory usage and throughput [30]. In contrast, DL-ReasonSuite is explicitly designed to evaluate large language models as reasoning agents, focusing on correctness, semantic faithfulness and entailment preservation under strict Description Logic semantics. Rather than measuring how efficiently a reasoner processes an ontology, the benchmark asks whether an LLM can correctly perform core DL reasoning tasks, translate between natural language and formal OWL representations without semantic loss and reliably interact with external symbolic reasoners when internal reasoning is insufficient. This shift in evaluation target constitutes the central novelty of DL-ReasonSuite.

Evaluation of TBox and ABox reasoning task with a controlled vocabularity complexity is supported by more recent benchmarks such as OWL2Bench. Nevertheless, these benchmarks primarily target symbolic reasoners and emphasize efficiency metrics such as runtime and memory usage. On the contrary, LLMs require totally different approach as they operate under context-length constraints and produce probabilistic outputs. This difference makes it hard for the LLM benchmarks to reach a quantitative evaluation of reasoning complexity of DL tasks. Our work comes to close this gap by repurposes the spirit of these benchmarks by focusing on accuracy and correctness of reasoning rather than throughput and by designing tasks that can be evaluated within a single LLM prompt.

While benchmarks such as LUBM, UOBM, and ORE have played a central role in evaluating Description Logic reasoners, their evaluation objectives differ fundamentally from those of DL-ReasonSuite. LUBM provides 14 fixed test queries over a synthetic university ontology, while UOBM extends LUBM by increasing OWL expressivity with its commonly used workload consisting of 15 standard queries. ORE, in turn, focuses on comparative evaluation of reasoners rather than tasks, featuring 14 competing reasoners across 6 tracks covering consistency, classification and realization under OWL 2 DL and EL profiles. Although, ORE reaches a close call for DL tasks its evaluation does not correlate with the LLM consistency and reasoning. In contrast, DL-ReasonSuite is explicitly constructed as an LLM-oriented benchmark. It is targeting the reasoning behavior of language models rather than the efficiency of symbolic engines. DL-ReasonSuite comprises 4740 tasks organized into seven task types across 3 reasoning tracks that includes not only core DL inference but also entailment-aware querying and natural language-to-formal representation translation. Moreover, DL-ReasonSuite does not disgard other benchmarks and adapts structural patterns from LUBM. These adaptive patterns are re-instantiated using a large and diverse set of ontology symbols and task templates to ensure compatibility with single-prompt LLM evaluation and to assess full reasoning pipelines rather than isolated query answering.

As these developments unfold, the role of large language models (LLMs) within this ecosystem is rapidly evolving. Large-scale pretrained models have demonstrated emergent reasoning capabilities, particularly under few-shot learning settings, suggesting that certain forms of reasoning may arise implicitly from scale [31]. Subsequent work showed that explicitly prompting models to generate intermediate reasoning steps—known as Chain-of-Thought prompting—substantially improves performance on complex multi-step reasoning tasks [32].

Beyond purely internal reasoning, recent approaches have emphasized the integration of reasoning and acting. The ReAct framework enables language models to interleave logical deliberation with tool use and external information access, thereby extending their effective reasoning horizon beyond the context window [33]. In parallel, the release of open and efficient foundation models such as the LLaMA family has significantly lowered the barrier for controlled experimentation and benchmarking of reasoning-oriented LLMs [34]. Despite these developments for LLM reasoning, it has become increasingly clear that standalone LLMs struggle with strictly formal and rule-governed reasoning tasks. These tasks require deliberate execution of particular logical constraints in a specific domain. To address this limitation, neuro-symbolic approaches that combine the flexibility of neural language models with the precision of symbolic reasoning systems is emerged for better logical reasoning. Logic-LM introduces a framework in which again an LLM translates natural language problems into formal logical representations those are subsequently verified by external symbolic solvers. This transformation enables faithful logical reasoning [35]. Similar to Logic-LM, the LINC approach employs LLMs as translators between natural language and first-order logic. Translators were leveraging automated theorem provers to validate entailments and enforce formal correctness over a controlled vocabulary [36]. These approaches prove that when appropriately constrained and augmented with logical structures, LLMs can effectively function as neuro-symbolic reasoners and combining neural generalization with symbolic rigor [30].

Not just LLM-oriented approaches are present in tranformation of natural language contructs to logical representation. In parallel with ontology-focused benchmarks, the natural language processing community has proposed datasets aimed at evaluating logical reasoning in textual settings. The LogiQA benchmark assesses logical reasoning in machine reading comprehension by requiring models to derive conclusions from structured premises with the help of logical reasoning [37]. ProofWriter further extends this line of work by evaluating a model’s ability to generate logical implications, constructed proofs and abductive explanations from natural language inputs [38]. While these benchmarks provide valuable insights into linguistic and informal reasoning, they do not directly evaluate reasoning over a complex formal logic language axioms such as Description Logic axioms or OWL ontologies.

Parallel to ontology-focused benchmarks, the NLP community has proposed datasets such as LogiQA [37], ReClor [39], LogicBench [40] and DivLogicEval [41] to assess logical reasoning in natural language. These works are valuable but they are predominantly test informal reasoning and often conflate logical inference with language understanding or world knowledge and do not share a connection to other reasoning tasks already evaluated with the LLMs.

However, beyond coverage and task diversity, a more fundamental limitation arises when existing benchmarks are evaluated from the perspective of formal Description Logic reasoning. The defined logical language aims to establish a deterministic reasoning framework grounded in Description Logic under a closed-world assumption. For such a framework to be meaningfully integrated with knowledge graph and context graph structures, large language models must operate not merely as text generators, but as systems capable of performing inference at both the TBox and ABox levels, effectively approximating the behavior of a DL reasoner. Existing reasoning benchmarks do not impose this requirement as they are not designed to support strict and formally defined logical language families with explicit semantic constraints. However, as the integration of knowledge graphs and context graphs becomes increasingly prevalent, systems operating in these settings will inevitably be required to support such formal reasoning capabilities. In this context, our proposed benchmark makes this requirement explicit and visible and serving as a precursor to this transition. The resulting evaluation aims to promote more controlled and verifiable reasoning behavior, particularly by reducing hallucinations in smaller models and enforcing consistency in decision-making processes.

In this context, DL-ReasonSuite is proposed as a comprehensive benchmark for evaluating LLMs on formal Description Logic reasoning tasks. Inspired by established OWL reasoner benchmarks but adapted to the constraints and characteristics of language models, DL-ReasonSuite emphasizes accuracy across a diverse set of reasoning tasks, including core DL inference, ontology querying, and natural language translation. By evaluating both standalone and tool-augmented LLMs, the benchmark aims to provide a detailed and systematic picture of the current strengths and limitations of LLMs in formal logical reasoning.

The contributions of this work are threefold. First, we present the design of DL-ReasonSuite, detailing the methodology for constructing a balanced and comprehensive set of Description Logic reasoning tasks with automated scoring based on symbolic tools. Second, we provide an extensive evaluation of state-of-the-art reasoning-oriented LLMs on the benchmark, analyzing their performance across different task categories and evaluation settings. Third, we discuss the implications of the results, highlighting which aspects of formal DL reasoning are within reach of current LLMs and which remain challenging. By introducing DL-ReasonSuite, this work aims to support the development of language models that can more reliably reason over structured knowledge and contribute to the broader integration of neural and symbolic approaches.

2. Materials and Methods

2.1. Benchmark Overview and Design

DL-ReasonSuite is structured into 7 task types grouped under 3 high-level reasoning tracks. Each track corresponds to a different aspect of Description Logic reasoning competence, and the task types within each track target specific skills or problem formats. The entire suite consists of 4740 individual tasks, with a balanced representation of tasks across types (on average 677 tasks per type). All tasks are designed to be solvable with explicit information provided in the prompt, i.e., the necessary axioms or data are given to the model so that reliance on parametric memory or external knowledge is minimized (except in the designated tool-augmented cases). Below, we describe each track and its constituent task types.

2.1.1. DL-Core Track

Core ontology reasoning tasks carried out on a static set of DL axioms. This track includes three fundamental reasoning types common in OWL reasoning:

Consistency Checking: Determine whether a given ontology which can be accepted as set of DL axioms, is logically consistent or contains a contradiction. For example, the task might present a set of class axioms and assertions and ask if any logical conflict exists which was resulted with inconsistent ontology yields “No” for consistency. This tests the model’s ability to detect subtle logical contradictions given in a DL model.
Subsumption (Classification) Inference: Determine if one concept (class) is a subclass of another under the provided axioms. The model is given TBox axioms (class definitions, hierarchies, and restrictions) and must infer whether Class A ⊆ Class B (or not). This mirrors the ontology classification task that reasoners perform (computing subclass hierarchies).
Instance Checking (Realization): Determine if a particular individual is an instance of a given concept, given an ontology. The prompt provides TBox axioms and some ABox assertions (facts about individuals), and asks whether “Individual x is an instance of Class Y” is entailed. This is analogous to the realization task (assigning individuals to classes) in DL reasoning.

Each DL-Core task is typically formulated as a yes/no question (or a choice between “consistent” vs. “inconsistent” for the first type). The ontology snippets used in these tasks are relatively small (on the order of 5–20 axioms) so that an LLM can theoretically parse and reason over them within its context window. We ensured that across the tasks, a variety of DL constructs are covered (e.g., existential and universal restrictions, disjointness, and domain/range axioms, etc.) to avoid overspecialization. The ground truth for these tasks (correct answer) is determined by running a DL reasoner on the given axioms prior to posing the question to models.

2.1.2. DL-Query Track

Knowledge base querying tasks that require logical entailment. DL-Query Track is also informed by prior benchmarks and neuro-symbolic approaches targeting logical reasoning in query-based and textual settings. LogiQA introduced a challenge dataset for evaluating logical reasoning in machine reading comprehension, highlighting the difficulty of deriving correct conclusions from structured premises expressed in natural language [37]. More recently, LINC demonstrated that integrating large language models with first-order logic provers enables more faithful logical reasoning, motivating DL-Query tasks that require precise entailment-aware query answering rather than surface-level pattern matching [36]. These prior efforts indicate that logical reasoning competence is most clearly exposed in query-based settings, where correct answers must be derived through inference rather than retrieved directly from explicitly stated facts. In Description Logic-based knowledge systems, such inference-oriented reasoning is naturally expressed through entailment-aware querying. To simulate this behavior, in this track, tasks involve SPARQL queries or query-like questions which the model must answer by reasoning over a given knowledge base. Accordingly, tasks involve entailment-aware SPARQL queries or query-like questions which the model must answer by reasoning over a given knowledge base. We focus specifically on entailment-aware SPARQL queries, meaning the queries may involve triples that are not explicitly present but can be inferred from the ontology. For example, a task might provide a small ontology (TBox and ABox, e.g., a mini university dataset with classes like Professor, Course, etc., and relationships) and then pose a SPARQL ASK or SELECT query that requires inference. A sample could be given axioms that every faculty is a person and an individual Alice is a faculty member, a query might ask if Alice is a Person (which requires the subsumption inference faculty → person). The model’s job is to return the correct boolean result for ASK queries or the correct list of entities for SELECT queries, based solely on the provided data plus OWL reasoning. We generated queries of varying complexity (from single triple pattern queries to ones with multiple joins and optional clauses) and ensured that at least one non-trivial inference (class hierarchy, property transitivity, etc.) is needed to get the right answer. The ground truth answers were computed using a standard SPARQL engine combined with an OWL reasoner (i.e., materializing the inferred triples or using a reasoner-backed query engine). This track tests the model’s ability to understand a formal query language and perform multi-step logical retrieval, effectively simulating a knowledge base query scenario.

2.1.3. DL-Bridge Track

Bridging between natural language and OWL formalism, and leveraging tools for inference. This track contains the remaining task types that involve translation or hybrid reasoning:

NL → OWL Translation: Convert a natural language statement or question into a formal Description Logic expression or OWL axiom. For instance, the prompt might say: “Every employee who works under a manager also reports to that manager.” and expect the model to output an OWL axiom (in a specified syntax, e.g., Manchester or Turtle) capturing this rule (which could be something like an OWL property axiom or a class inclusion). Another variant is translating an English question into a formal query (e.g., a SPARQL query or an OWL class expression that represents the query intent). These tasks evaluate the LLM’s ability to map natural language semantics to formal logic constructs, a key step in building natural language interfaces for ontologies.
OWL → NL Translation (Verbalization): The inverse of the above: given an OWL axiom, class expression, or an entailment in formal notation, produce an equivalent or explanatory natural language sentence. For example, provide an axiom like Equivalent Classes (Person Object, and Union Of (Man Woman)) and ask the model to verbalize it (e.g., “A person is either a man or a woman.”). This tests the model’s comprehension of formal syntax and its ability to communicate it in understandable language. Such verbalization is useful for explaining logical content to non-experts.
Tool-Augmented Entailment: This is a unique task type where the model is explicitly allowed (and encouraged) to use an external DL reasoner tool to help answer a query or verify an entailment. We designed these tasks to be too complex for the LLM to easily solve alone within its context. For example, we might provide a larger ontology (near the limit of the model’s input size) and ask a question that requires a multi-hop inference or a combination of rules that would be very challenging for the model to do reliably. The model is instructed (via a prompt format or system message) on how to call a reasoning tool (for instance, by outputting a specific command or query that the evaluation framework will execute using a reasoner like HermiT or Openllet). The task is considered successful if the model produces the correct final answer with the help of the tool. This setup evaluates the model’s ability to decide when to use a symbolic tool and to interpret its output. It is akin to testing an LLM’s skill in orchestrating a proof with assistance, reflecting real-world scenarios where a reasoning system might delegate heavy logical lifting to a dedicated module. In our experiments, only some models (notably Phi-4 Reasoning Plus) had been augmented/trained to use tools, making this a particularly discriminative test of advanced reasoning integration.

All tasks in DL-ReasonSuite were validated to ensure clarity and correctness. For each task, we have a gold-standard answer or solution. In the case of tasks with binary or categorical answers (yes/no, or selecting the correct formal axiom among options), the gold answer is unambiguous. For generative tasks (translations), we define correctness in terms of logical equivalence or semantic adequacy rather than exact string match: an output OWL axiom is considered correct if it is logically equivalent to the reference axiom and an output natural language description is considered acceptable if it accurately captures the meaning which is assessed by logical experts or using heuristic matching of key elements. We developed automated scoring scripts using the OWL API and reasoners for formal outputs and custom regex/keyword matching for natural language outputs, combined with manual review for a sample to calibrate. This ensures that the evaluation is as objective as possible, leveraging tools to measure logical correctness where applicable. Figure 1 illustrates the weighting scheme used in the evaluation framework, showing the relative contribution of each reasoning category to the final overall score. More information regarding DL tracks could be accessed from the Supplementary Materials.

2.2. Task Generation and Dataset Preparation

The 4740 tasks in DL-ReasonSuite were generated through a combination of synthetic data generation, adaptation of existing benchmark queries and manual crafting for critical cases. Our guiding principle in generation was to cover a broad range of DL concepts and difficulties while keeping each individual task self-contained for an LLM. Key aspects of the generation process include:

Ontology Source and Variety: We drew from multiple domains to make sure the benchmark does not overfit to a particular context. For example, some tasks (especially in DL-Core and DL-Query) use a university domain ontology inspired by LUBM (with classes like Student, Professor, Course, etc.), while others use a family genealogy domain, a medical ontology fragment (patients, symptoms, diseases) or a generic “animals taxonomy” ontology. By varying domains, we require models to truly rely on the given axioms rather than any single domain-specific prior. Many ontologies were generated or adapted via OntoBench-style configuration: we programmatically created class hierarchies of controlled depth, randomly assigned constraints (e.g., disjointness, domain/range for properties) and instantiated individuals with relationships to support instance queries. Each generated ontology was checked with a reasoner to identify implicit truths to ask about (for subsumption, instance, query tasks) or to verify consistency status.
Controlled Complexity: We varied the complexity across tasks to assess how model performance scales. Simpler tasks might involve a taxonomy of 5–6 classes with one straightforward inheritance to check, whereas the hardest core tasks might involve combinations of constructs (for instance: an ontology with 2–3 inclusion axioms, plus a few individual facts that together cause a subtle inconsistency). For query tasks, some queries were simple (single triple patterns) and others complex (involving multiple triple patterns, filters, or requiring 2–3 inference steps). By tagging tasks with a rough difficulty level, we ensured that the overall benchmark includes a spectrum from easy to very challenging problems. This also allows analyzing performance drop-off as complexity increases which relates to known LLM reasoning limits.
It is important to note that DL-ReasonSuite does not impose a strictly incremental or continuous notion of task difficulty. Instead, task complexity is introduced through discrete combinations of Description Logic constructs, such as the presence of negation, disjointness, multi-hop subsumption chains, or constraint interactions.
Template-Based Synthesis: Several tasks were generated using templates to ensure systematic coverage. For example, for consistency tasks, we had templates that introduce a contradiction (like a class that is declared disjoint with another and an individual asserted to be an instance of both) versus templates that resemble contradictions but actually are not (to test if the model can avoid false positives). For subsumption, we used patterns like random DAGs (directed acyclic graphs) of subclass relations and picked random pairs of classes that are or are not in a subclass relationship. Instance checking tasks often combined a hierarchy with some instance data; we would randomly choose an instance–class pair that is entailed and another that is not, to form two separate questions. These templates were instantiated with different ontology symbols (names) across domains to produce many unique items.
Borrowing from Existing Benchmarks: We also incorporated or adapted some queries and scenarios from existing benchmarks to leverage their well-thought-out design. For DL-Query tasks, we reviewed the 14 queries of LUBM and selected a few representative patterns (e.g., “retrieve all graduate students of a given department and their advisors”, which, in LUBM, requires reasoning about class hierarchy and property connections). We simplified or scaled down such queries to fit into an LLM prompt (for instance, instead of thousands of data instances as LUBM would use to test database performance, we include maybe a dozen instances such that the answer is a short list). Similarly, some NL → OWL translation items were inspired by sentences from textbooks or tutorials on OWL (“If something is a mother, then it must be a female parent” as a natural statement to translate into axioms about Mother, Female, Parent). Adapting known benchmark questions ensured that our tasks are meaningful and non-trivial.
Human Verification: Although much of the dataset was auto-generated, we manually verified a subset of tasks from each category to correct any ambiguous wording or unintended tricky corner cases. In particular, for NL↔OWL translation tasks, we had ontology experts confirm that the natural language and OWL statements truly correspond, because subtle linguistic ambiguity could otherwise confuse the model or the evaluator. We also checked all tasks with expected negative answers (e.g., a query that should return “no results” or a subsumption that is not true) to ensure that there was not a hidden entailment we missed—essentially validating that the answer key is correct. The use of multiple reasoners, checking consistency with both HermiT and Pellet, added confidence that the gold labels are reliable across different inference engines.

In terms of benchmark implementation, each task is formatted in a prompt file with a standardized structure. For example, a subsumption task prompt might list the ontology axioms in a readable form (using a Controlled Natural Language or a simple description), followed by the query “Is ClassA a subclass of ClassB? Respond with Yes or No.”. For SPARQL query tasks, we present the data as a list of RDF triples or a short Turtle snippet plus the SPARQL query and ask for the answer. For translation tasks, we explicitly instruct the model on the output format (e.g., “Provide the OWL axiom in Manchester Syntax.”). This careful prompt formatting is done to minimize misunderstandings and focus evaluation on whether the model can solve the reasoning problem when it fully comprehends the task, rather than it being tripped up by presentation issues.

Finally, to facilitate evaluation, we developed an automatic grading pipeline Algorithm A1. After an LLM produces an answer, the pipeline parses the answer and compares it to the gold solution using appropriate measures. For yes/no and multiple-choice, this is straightforward string matching against the correct option. For list outputs (e.g., a list of entity names in a query result), we compare sets of items ignoring order and we account for minor format variations (like quoting of IRIs) to not unfairly mark a correct answer wrong. For OWL axiom generation, we actually input the model’s axiom and the reference into a reasoner to check for logical equivalence or entailment; this accounts for cases where the model’s phrasing is different but semantically correct. Each model’s performance is then aggregated per task type and overall. We report standard metrics like accuracy (percentage of tasks answered correctly as Yes or No) for each task category, as well as macro-averages across categories. In some tasks where partial credit is meaningful (e.g., if a model retrieved 4 out of 5 correct answers in a query), we also compute precision and recall, but our primary reported metric for simplicity is accuracy (full correctness on a task). The next section presents the results of running five prominent LLMs through this evaluation setup.

2.3. Tool Invocation and Experimental Settings

All models were evaluated under two controlled conditions: (i) standalone mode, where no external tools were available and (ii) tool-augmented mode where an external DL reasoner could be invoked. In the standalone setting, the model received only the task prompt and was required to produce the final answer directly. In the tool-augmented setting, the model was given access to a single reasoning tool (an OWL/DL reasoner) through a uniform function–call interface. Specifically, the model generated a structured tool request containing (a) the ontology axioms (TBox) and, when applicable, assertional facts (ABox), (b) the target reasoning service (e.g., consistency checking, subsumption, instance checking, or entailment/query answering), and (c) the query or axiom to be verified. The tool returned a deterministic result which the model then used to generate the final response. To ensure comparability across models, the same task instances, prompt templates, and output format constraints were used in both conditions; the only difference was the availability of the reasoning tool. For models that do not natively support tool invocation, tool access was disabled and they were evaluated strictly in standalone mode. For models with tool support, we report results for both modes using identical decoding settings and the same tool backend. We additionally specify the tool backend version, timeout limits, maximum number of tool calls per instance and any retry policy to ensure full reproducibility of the experimental setup.

3. Results

We evaluated five contemporary large language models on DL-ReasonSuite: Kimi k1.5, Llama-Nemotron Ultra, DeepSeek-R1, Phi-4 Reasoning Plus, and Phi-4 Reasoning. These models, all released in 2025, were chosen for their strong reasoning or tool-use capabilities as reported by their developers. Phi-4 Reasoning and Phi-4 Reasoning Plus are variants of a fourth-generation “Phi” model, with the Plus version having extended ability to use external tools (e.g., plugins for reasoning). DeepSeek-R1 is a reasoning-specialized model reputed for human-like logical deduction. Llama-Nemotron Ultra represents an open-source foundation model enhanced for logical tasks and Kimi k1.5 is a smaller-scale model fine-tuned for knowledge and reasoning (included to test a lower parameter baseline). Each model was prompted with the identical tasks (with appropriate few-shot examples in prompts when necessary for tool-usage formatting) under the same conditions, and their answers were collected for scoring. Table 1 summarizes the overall performance of each model, along with a breakdown by sub-reasoning track (DL-Core, DL-Query, DL-Bridge).

Figure 2 provides a visual summary of the overall weighted rankings reported in Table 1. Looking first at the overall accuracy, we observe a wide range of performance. The top-performing model, Phi-4 Reasoning Plus, attained 85% overall correctness, a considerable margin above the others. The base Phi-4 Reasoning model is second with 76%, followed by DeepSeek-R1 (72%), Llama-Nemotron Ultra (68%), and finally Kimi k1.5 (61%). These findings align with anticipated trends according to model size and degree of specialization. The smaller Kimi model performs significantly worse than the Phi-4 Reasoning family, especially the Plus variant. However, no model is able to fully complete a task. This demonstrates that the difficulty of DL-ReasonSuite benchmark is significant to evaluate the performance of the state-of-art LLMs. Moreover, the results show that the existing models are still far from reaching their maximum performance. The most promising reasoning-based LLM has an overall success rate close to 85% and already failed about one in six formal reasoning problems. Existing Chain-of-Thought (CoT) approaches which encourage the explicit generation of intermediate reasoning steps in large language models have plenty of room for development regarding improve reasoning performance. Refs. [32,42] showed that significant accuracy gains were achieved in multi-step reasoning tasks when models were directed to express intermediate inference steps. Similarly, DL-ReasonSuite results show that models exhibiting explicit or implicit CoT-like reasoning behaviors produce more consistent and accurate results, especially in complex DL-Query and DL-Bridge tasks.

3.1. Performance by Track

The breakdown across the three reasoning tracks provides further insights:

DL-Core (Consistency, Subsumption, Instance): DL-Core evaluate the models’ ability to achieve the core functions in logical reasoning. Results show that all models response with high accuracy regarding other tracks. Even the weakest model (Kimi) correctly answered about 76% of core reasoning questions, although Phi-4 Reasoning Plus nearly solved the questions with 92% accuracy. The relatively strong performance results for this track resolve that modern LLMs can handle basic ontology reasoning tasks in small contexts to a significant extent. Many DL-Core tasks essentially reduce to identifying if certain relationships logically follow from others (which can sometimes be achieved by pattern recognition on the given axioms). The Phi-4 Reasoning models and DeepSeek performed exceptionally well, often correctly identifying subtle inconsistencies or entailment. For example, all models easily handled straightforward taxonomic inferences like if A ⊆ B and B ⊆ C ⇒ does A ⊆ C? All models were answered this question as “Yes” correctly. Despite the fact that all models successfully entail the target statements, errors in this track tended to occur at the more complex end of the spectrum, particularly in cases involving combinations of constructs such as negation or disjointness. One common mistake for the lower-performing models was missing a hidden contradiction. For instance, in a task where an individual was asserted to belong to two classes that were declared disjoint classes, some models (notably Kimi and Llama) incorrectly judged the ontology as consistent (which it should not be), presumably because they did not fully process the implication of the disjoint axiom. Phi-4 Reasoning Plus and Phi-4 Reasoning (and to a lesser extent DeepSeek) rarely fell for such traps, indicating better logical rigor or training on such patterns. Another error pattern was in instance checking tasks with longer inference chains (e.g., inferring an instance of a grandparent class via two subclass steps)—Kimi k1.5 sometimes said “No” (not an instance) when the correct answer was “Yes” because it failed to transitively climb the class hierarchy. Overall, Phi-4 Reasoning Plus was the leader in DL-Core tasks (92%), though notably the gap between it and the base Phi-4 Reasoning (90%) is small here. This implies that for straightforward inferences, the internal reasoning of the base model was nearly sufficient and tool use was not often needed.
DL-Query (Entailment-aware SPARQL): Despite DL-Core, DL-Query with entailment-aware SPARQL is proven to be difficult for all models. The results show that Phi-4 Reasoning Plus has the best accuracy with 80%. Kimi managed to achieve only 62% precision. These outcomes could be predicted because the nature of LLMs could have a hard time to interpret and execute structural non-trivial queries. For the LLMs, SPARQL query syntax have potentially complex to interpret and this resulted the fact that LLMs have misapplication of the ontology facts during reasoning. For example, some models would miss filter conditions or misidentify which triple patterns could be satisfied by the given data for the given query. This mistake was risen especially in one query task where the query asked for instances of a class that satisfies a certain property condition. Yet some models lack the understanding of the constraint feature and listed all individuals that belonged to the class. This shows that some models potentially have detect subsumption but are unable to deliver constraint satisfaction for the given query. This problem could potentially have risen when the models tried to satisfy the constraint with retrieval but missed the strict logical conditions. We also observed that DeepSeek-R1 and the base model of Phi-4 Reasoning had a tendency to produce plausible-sounding answers despite missing the constraint satisfaction even when those were not exactly the correct complete set. These general models have potential to overcome satisfaction complexity with the help of general understanding. However, this proposes an edge case to grade the success of reasoning for the models. Despite there being no sign of constraint satisfaction, the model sometimes mimics the overall understanding and gives some correct entities and omits others. Thus, our scoring is flagging these answers as incorrect due to missing semantics in the reasoning and this hugely penalizes the models that mimic the actual reasoning with partial recall values. We examined the relaxed metric too and in this evaluation, the models often had high precision but varying recall such as they rarely introduced entities that were not answers and this resulted as the model sometimes failing to output all valid answers. Phi-4 Reasoning Plus demonstrated 10% better accuracy than its basic counterpart in this rigorous scoring system. Plus, the model uses a strategic tool to fetch the answer to overcome the mistakes for a few tasks whereas the base model tried to reason it out and made a mistake. For example, in a query requiring two hops of property traversal, Phi-4 Reasoning Plus internally formulated a simpler sub-query to the reasoner and got the correct result as seen from its logged Chain-of-Thought, whereas base model of Phi-4 Reasoning tried to reason over it and missed one element for constraint satisfaction between hop traversal. The results show that the Phi-4 Reasoning Plus model was the top performer and DeepSeek at 75% was not far behind Phi-4 Reasoning (78%). These outcomes show that DeepSeek’s specialized reinforcement training might have helped in logical query comprehension. Llama and Kimi hugely lagged in this track and often misunderstood the SPARQL structure. Despite Llama’s general answers, some of Kimi’s answers were basically guesses or irrelevant to the subject and reasoning task.
DL-Bridge (NL → OWL Translation and Tool-Augmented): DL-Query has diverse results but DL-Bridge track had the widest separation between clusters of models for reasoning. The Phi-4 Reasoning Plus model achieved 86% accuracy, vastly higher than the others, thanks largely to its success in the tool-augmented entailment tasks. In contrast, models without tool-use ability suffered on those particular tasks, dragging their average down (Phi-4 Reasoning base 61%, Llama 53%, DeepSeek 55%, Kimi 45%). The performance differences observed in DL-Bridge tasks demonstrate that how Chain-of-Thought outputs are evaluated is also critical. Refs. [35,36] showed that sampling multiple Chain-of-Thought instances instead of a single reasoning chain and selecting the consistent result using a Self-Consistency approach significantly improves reasoning accuracy. This finding explains why models based on singular reasoning traces produce more fragile results, especially in DL-Bridge tasks involving uncertainty, such as NL → OWL translation and tool-assisted inference. To elaborate, the tool-augmented tasks comprised about one-third of the DL-Bridge items. Phi-4 Reasoning Plus, being designed to use external reasoning, solved nearly all of those: it would correctly output a reasoning command (we observed it querying the reasoner for, say, “is X entailed by the ontology?”) and then give the right answer. The other models either skipped the tool usage (some did not even attempt the special command) or tried to reason it out themselves and failed. For example, one tool-augmented query asked if a certain complex class expression was satisfiable given a set of axioms. Only Phi-4 Reasoning Plus managed to get the answer (by delegating to the reasoner); the others gave either an incorrect guess or an “I am not sure” type answer. Excluding the tool tasks, the gap narrows a bit in the pure translation tasks: in NL → OWL and OWL → NL, the Phi-4 Reasoning models were still the best, but Llama-Nemotron Ultra was not far behind, showing strength in language understanding. Specifically, for OWL → NL verbalization, all models did reasonably well in simpler cases like stating subclass axioms in English. However, hard examples such as verbalizing restrictions to create a restricted subset or complex class expressions those have a complete set of restrictions in object properties could only be satisfied with Phi-4 Reasoning Plus and responded with clear sentences based on knowledge base. Other models such as Llama-Nemotron Ultra and baseline Phi-4 Reasoning produced some awkward and unresolvable responses with partially incorrect statements such as missing "only" restriction in a sentence. On NL → OWL translation, Phi-4 Reasoning Plus, was again the most reliable and produced correct OWL syntax and semantics 90% of the time. The base Phi-4 Reasoning and DeepSeek had some failures to translate with syntactic errors in OWL like missing paranthesis or keywords to close definitions in addition to logical misinterpretations in complex sentences with full description. In our grading, we partially grade syntactic error to pass the test but if the meaning is not clear, the grading drops the syntactic error grade penalty and checks if the translations are correct or not to deduce accuracy for translation. Despite some partial errors in syntax and logical interpretations of other models, Kimi k1.5 struggled the most in translation tasks. Kimi k1.5 produced either an incorrect OWL axiom to represent entities or simply stated it in pseudo-English as a middle interpretation of semantics. This shows that Kimi k1.5 is generally likely to be lacked in training on formal language representation. These results prove that the segmentation between reasoning models are diverse in more complex translations and tasks between Phi-based models and others.

As shown in Figure 3, model performance varies significantly across the three DL reasoning tracks, with DL-Query consistently posing the greatest challenge. The combination of DL-Query and DL-Bridge can be interpreted as all models exhibiting very high accuracy in the subsetting task and in sample checking between parent and child definitions. Above these baselines, only Phi-4 Reasoning Plus model succeeds on tool-augmented entailment tasks. We also note that within each track, there were certain category leaders besides the overall winner. For instance, if we drill down to individual task types, there are certain winners. On NL → OWL translation tasks alone, the base Phi-4 Reasoning model was nearly as good as Phi-4 Reasoning Plus with both achieving 85% exact translation accuracy. On OWL → NL verbalization, Phi-4 Reasoning Plus and the base model were virtually tied at the top with 88% correctness with a close second/third as Llama-Nemotron Ultra with 80% accuract. Despite other scores of LLama based models, LLama-Nemotron Ultra can approach the reasoning problem translation in a purely language generation task such as explaining axioms in English. This shows that Llama based models can reasonablely precise in tasks with heavily based on language fluency and understanding of simple logic. However, on structured query answering and consistency checking, the Phi models clearly led the competition and also combination of scale and fine-tuning on reasoning data gave these models more precise results.

In terms of error analysis, due to nature of LLM structure, we identified a few notable patterns across models:

Error handling of lower-performing models such as Kimi and to some extent Llama, resulted in plausible guesses based on general knowledge on the subject despite the reasoning tasks mainly subject to strictly following the given facts. For example, query on a general subject such as family relations on family tree ontology was transformed into a common sense problem and answered wrong by Kimi k1.5. However, the task was based on the factual check of provided data despite the model assuming two people were siblings because typically people with the same parents are siblings but it overlooked a detail that in the data that they were not both listed as children of the same parent. This error handling mechanism illustrates that smaller LLMs might revert to learned heuristics if they cannot fully perform the logical deduction whereas larger models stuck more to the evidence with the existing reasoning power.
The tool-augmented approach of Phi-4 Reasoning Plus not only improved correctness but changed the behavior: in the few cases where Phi-4 Reasoning Plus answered incorrectly, it was often because it did not invoke the tool when it actually should have (perhaps the prompt did not sufficiently trigger the tool-use every time). We saw one case where it tried to reason out a complex consistency question itself and gave a wrong answer, whereas if it had made the tool query (as it did in similar tasks), it would have got it right. This suggests that meta-reasoning is a crucial factor to better judgment of a domain and during reasoning, a slight misjudgment during entailment can lead to failure.
DeepSeek-R1, which was noted to mimic human-like reasoning, sometimes wrote out a long Chain-of-Thought as its answer. While our scoring looked only at final answers, the presence of these reasoning traces (when we examined model logs) was interesting. In some instances, DeepSeek’s explicit reasoning led it to the correct conclusion (like it would enumerate possibilities and rule them out, then answer correctly), but in other instances, the Chain-of-Thought contained a logical slip. For example, in a consistency check, DeepSeek reasoned that “Class X is disjoint with Y and individual a is X and Y, thus this is inconsistent, so answer: inconsistent”—which was correct. Yet in another, it reasoned through a complex property restriction incorrectly and arrived at a wrong consistency verdict. The fact that it tries to reason stepwise is promising, but it also shows that if one step is wrong, the final answer is wrong and unlike a symbolic reasoner, DeepSeek-R1 does not guarantee correctness of each step.
The models’ performance being relatively independent of task complexity up to a threshold (then dropping off) is consistent with recent findings on reasoning models. In our benchmark, simpler instances of each task type were almost always solved by the top models, but beyond a certain complexity (varying per task), performance degraded considerably. For example, most models had no trouble translating a single sentence like “All X are Y” to an OWL subclass axiom, but when given a sentence with multiple clauses and qualifiers, errors became common. Similarly, a SPARQL query with one join was fine, but with three joins and a filter as complexity of queries, most of the models’ accuracy dropped. This reinforces the notion that there is a complexity boundary for current LLM reasoning that state-of-the-art models handle shallow logic well but can break down on deeper inference that would be routine for a formal reasoner.

Overall, the results provide a comprehensive assessment of the state of LLMs on Description Logic tasks. Figure 4 breaks down DL reasoning performance by individual task types, revealing systematic weaknesses in SPARQL querying and relative strengths in consistency and subsumption tasks. The clear dominance of the Phi-4 Reasoning models with tool augmentation indicates that with sufficient scale and the ability to call external reasoning procedures, state-of-art LLMs can reach high success rates even on fairly advanced logical problems. However, the gap between the best and the rest of the models shows that not all LLMs are equal in reasoning. Some LLMs were constructed with specialized training and system design which changes the overall reasoning greatly. This is why even the best model is not infallible to these misunderstandings and errors in certain task categories especially in the SPARQL query answering and complex translations. This leaves a significant room, an average 15% of the cases, for improvement to achieve full semantic reasoning. Figure 5 highlights SPARQL querying as the most challenging DL task, with all models exhibiting substantially lower accuracy compared to other DL reasoning metrics. In the next section, we discuss the implications of these findings, comparing back to traditional reasoners and considering how we might further close the performance gaps.

3.2. Difficulty-Stratified Performance Analysis

To address task difficulty explicitly, we perform a layered analysis based on observed empirical hardness. Specifically, we partition items into four difficulty tiers (Easy, Medium, Hard, and Very challenging) by computing the per-item mean score across all evaluated models and binning items into quartiles, with the hardest quartile labeled as Very challenging. Table 2 reports overall performance across difficulty tiers, including the number of tasks per tier and mean accuracy with 95% bootstrap confidence intervals.

Across all models, performance degrades consistently as difficulty increases, confirming that the benchmark exhibits a meaningful and non-trivial difficulty stratification. While near-ceiling performance is observed on Easy items, accuracy drops sharply in the Very challenging tier with reductions exceeding 40 percentage points for several models. This trend holds across all reasoning tracks, indicating that increased structural and semantic complexity systematically stresses model robustness rather than isolated task artifacts. A detailed breakdown by subtrack (DL-Core, DL-Query, DL-Bridge-NL, and DL-Bridge-Entail) is provided in Table A1, Table A2 and Table A3.

Figure 6 visualizes the systematic degradation in model performance as task difficulty increases under the empirical hardness definition. Across all subtracks, near-ceiling accuracy on Easy items gives way to sharp declines in the Hard and Very challenging tiers. The effect is most pronounced for DL-Bridge-NL and DL-Bridge-Entail, suggesting that language-to-axiom synthesis and entailment-sensitive decisions contribute disproportionately to the hardest instances. Notably, while all models degrade, Llama-Nemotron Ultra consistently retains higher accuracy in the most challenging tier, indicating greater robustness under increased structural and semantic complexity.

4. Discussion

The experimental results from DL-ReasonSuite offer a comprehensive analysis of the formal and DL reasoning capabilities of LLMs and also to reveal their associated limitations. Figure 7 shows how model rankings change when DL reasoning performance is incorporated. These results underscore the impact of formal reasoning evaluation with state-of-the-art LLM models.

The evaluation results indicate that current reasoning-based LLMs achieve satisfactory success in handling straightforward and moderately complex logical structures. However, they still face substantial challenges in scenarios that demand formal accuracy and guaranteed inference which is essential regarding the nature of the domain. Results show that LLMs can learn core DL inference patterns such as subclass relations, discreteness constraints, and basic consistency checks to a certain degree of accuracy. This is noteworthy, because these logical structures are not explicitly and systematically represented in natural language. This capability of the LLM models can largely be explained by their ability to learn to mimic targeted fine-tuning processes, logical instructions or thought chain-like reasoning patterns. In this context, it can be said that LLMs can approximately simulate the behavior of a symbolic reasoner under limited conditions for easy tasks.

Findings from ontology consistency checks demonstrate that robust models such as Phi plus can identify implicit contradictions of the nature of the semantics by evaluating multiple axioms together for correct reasoning path. This capability is notable due to the fact that logical consistency is not an explicit goal in standard large language model training practices. Figure 8 contrasts DL reasoning performance with math reasoning, illustrating that strong mathematical ability does not necessarily translate into formal ontology reasoning competence. Furthermore, the high accuracy rates observed in translation tasks from natural language to OWL statements reveal that LLMs have noticeable potential as helpful tools in ontology engineering processes. The generation of correct axioms for specific requirements are resulted with success or it is represented with minor syntactic corrections. This result shows how the leap between natural language to logical structures can be overcome with the help of the reasoning capabilities of LLMs.

However on the other side, upon analysis of the results, it can be acknowledged that LLM-based approaches suffer from contradictions due to their structural weaknesses. The most fundamental limitation is the inability to guarantee accuracy and consistency even for a simple logical representation. While determinism is crucial in logic-based models and the entailment remains as the core of the reasoning process, the model in a logical family is complete and sound as it responses with error-free results for all kind of queries. In their nature, LLMs are inherently probabilistic and prone to error. Experiments have shown that even the most robust models exhibit significant error rates on basic tasks. Thus, these errors systematically intensify as inference chains lengthen or interactive constraints increase during reasoning.

The structure of the training process of LLMs can also have effects despite working memory and the logical depth of the models. LLMs lack the systematic search, backtracking with full deterministic calculation and proof-generating mechanisms found in symbolic reasoning engines. Instead, their reasoning relies heavily on superficial regularities learned during training and the current context of the given input/query. This difference results in logically flawed outputs and even if the models frequently produce linguistically fluent answers, the query deviates from familiar patterns or requires multi-step precise inference. These minor errors pose significant risks in domains and applications requiring formal information processing and explainability.

Interpretations and parsing of the formal query languages also presents a significant challenge in evaluation. In languages requiring strict syntax and procedural adherence, such as SPARQL, model performance consistently falls short compared to natural language-based tasks due to misunderstandings or missing logical structures defined. Models often attempt to process these queries by internally converting them into pseudo-natural language representations, creating an additional layer of abstraction which is not enough to represent a full logic language family which must be handled as a satisfaction problem over a domain knowledge model. Thus, the benchmark is constructed to promote success in accuracy for particular cases those requiring multi-step joins and inference on complex cases. These results conclude the need for hybrid architectures in which the query interpretation and execution roles of LLMs are separated but synergized in structure.

This synergy in construction of next-generation LLM-based systems is critical. LLMs will be reinforced with an external and deterministic reasoner to construct a complete model for every task regarding real-world problems. The fact that the integration is designed to require the model to decide under what circumstances to engage the tool underscores the importance of meta-reasoning capability. Next structure of neuro-symbolic architectures will be described where the LLM handles the unstructured and linguistic components, while the formal and computational overhead is delegated to symbolic modules.

Our evaluation focuses on task construction and metric reporting rather than an exhaustive analysis of prompt sensitivity. We acknowledge that structured reasoning benchmarks for LLMs can be sensitive to evaluation protocol choices, particularly prompt phrasing, few-shot demonstration selection, decoding stochasticity, and output parsing heuristics. These factors can materially affect performance, especially for tasks involving structured outputs such as OWL axioms and SPARQL query results. Consequently, while we report strong aggregate performance for several models, the reported results should be interpreted as conditional on the specific prompting and parsing setup adopted in this study. Future work will explore systematic robustness analyses across alternative prompting strategies and decoding configurations.

5. Conclusions

In this work, we introduced DL-ReasonSuite, a comprehensive benchmark for evaluating the formal reasoning capabilities of large language models in the context of Description Logic. We have set up 4740 tasks covering ontology consistency checking, taxonomic inference, instance classification, SPARQL query answering with entailment, natural language ↔ OWL translation and tool-assisted reasoning. These diverse tests have been configured to advance the robust logical reasoning of LLMs. This benchmark is carefully designed to represent diverse and challenging Description Logic reasoning and meant to score automatic reasoning tools for objective scoring. Thereby, we have established a rigorous evaluation framework grounded in the different levels of semantics of OWL/DL.

A benchmark was applied to five state-of-the-art reasoning-focused models to uncover the capabilities of these LLMs on the most difficult reasoning achieved in knowledge-based systems with Description Logic. In this application, we have first observed that LLMs have made notable strides in logical reasoning. The best reasoning models successfully handled the majority of DL tasks such as subclass relation extraction, ontology consistency check and even some complex query inferences. A second observation revealed that these models have clear limitations on complex logical inferences and tend to cluster errors around tasks requiring multi-step or high-precision reasoning such as intricate SPARQL queries or translations involving nested logical reasoning. Out of all five models, none of the candidates achieved near-perfect accuracy from different categories. These results highlight the gap between neural reasoning and the gold standard of symbolic reasoners in the literature. Moreover, as the third observation, augmenting LLMs with symbolic tools proves highly effective. The top performing model, the Phi-4 Reasoning Plus, was symbiotically merged with an OWL reasoner plugin to solve the hardest problems and this collaboration resulted with boost to the accuracy above its peers. The findings of this study reveal that next-generation large language models with an initial support of reliable logical reasoning requires explicit integration with symbolic mechanisms. Approaches such as Logic-LM demonstrate that coupling LLMs with symbolic solvers will lead to more faithful and verifiable logical inferences with better explainability. This inference particularly demands strict adherence to formal semantics over the domain of the knowledge base of models [35]. More broadly, the results of the use cases characterize LLMs as neuro-symbolic reasoners and suggests that future progress in formal reasoning will depend on principled hybrid/synergized architectures rather than purely neural or purely symbolic solutions [30]. This collaboration result underscores the promise of neuro-symbolic hybrid approaches in which LLMs and formal algorithms cooperate to achieve better results than either alone.

Through the tests of the state-of-the-art LLMs in the literature for reasoning tasks, there are strengths and weaknesses observed through experimentation. First, large LLMs can generalize many DL reasoning patterns in house but they can also exhibit brittle behavior such as overlooking constraints or defaulting to most plausible/generally correct answers. However, the logical cases such as edge reasoning tasks regarding negation or restriction resolved with incorrect answers especially when edge cases was experimented for implicit reasoning capacity. We emphasized that formal benchmarks like DL-ReasonSuite play a crucial role in stress-testing LLMs’ logical coherence and rigor in future capabilities of LLMs especially to represent human-like behavior for these kinds of extreme reasoning tasks. These kinds of tasks complement existing evaluations by focusing on exacting criteria of correctness that leave little room for error which also limits the capabilities of LLMs over its innate problems such as hallucinations. Especially more general or small LLMs were confused with the complexity of DL even for small representations which jeopardy the tool-use of these models’ decisions regarding real-life situations. The results of the benchmark reveals that incorporating structured reasoning training or developing better ways for models to internally simulate logic is crucial to develop more advanced LLM models. On the contrary, the users of these models tend to leave the practices of their domains to these LLMs and need more caution when deploying LLMs in knowledge-sensitive domains. These uses need to be carefully verified with semantically adapted verification mechanisms to ensure reliability on critical tasks.

In addition to overall accuracy results, the difficulty-stratified analysis provides a clearer picture of where current LLM-based reasoners begin to fail under increasing logical complexity for DL tasks. When tasks are grouped by empirically observed hardness, all evaluated models exhibit near-ceiling performance on easy instances but suffer sharp and systematic degradation on harder items with accuracy drops exceeding 40 percentage points in the most challenging tier. This behavior is consistent across all reasoning tracks and is particularly pronounced in DL-Bridge tasks where language-to-axiom translation and entailment-sensitive reasoning impose stricter semantic and structural demands. These results suggest that current performance limitations are not driven by isolated task artifacts but rather reflect a fundamental robustness boundary in handling deeply compositional and constraint-heavy reasoning problems.

Through the results of the state-of-the-art reasoning models, we believe this benchmark will spur further research into making LLMs more logically grounded. The infrastructure of the LLMs will be evolved to a more semantically advanced model architecture and their training regimes will be enriched with logical training regimes with tight integration with the existing reasoning systems. As AI models continue to evolve into more semantic reasoning models, we envision that excellence on formal reasoning benchmarks will become an integral part of the criteria for truly advanced AI. This will continue with the experimentation of logical reasoning levels and through semantic reasoning, AI that can fluidly move between natural language and formal knowledge through maintaining the precision and consistency of a symbolic reasoner while retaining the flexibility and understanding of an LLM. DL-ReasonSuite contributes towards that vision by providing a yardstick to measure progress. As future work, we would like to extend the benchmark, improve model performance and deepen the analysis of reasoning in large language models. This road will pave the way to autonomous agents which were the main goal of the Semantic Web at the first place. These agents could now benefit from a range of applications from intelligent knowledge management systems which were creating robust and sound decisions and data with semantics. In addition to LLMs and symbolic models, this new collaboration is already ready to integrate with the existing semantic web knowledge base such as linked data and semantic web services and complex decision-support systems where the synergy of human-like language skills and uncompromising logical correctness is highly prized.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app16041821/s1.

Author Contributions

Conceptualization, M.O. and O.B.; methodology, M.O. and O.B.; software, O.B.; validation, M.O. and O.B.; formal analysis, M.O.; investigation, M.O.; resources, M.O. and O.B.; data curation, O.B.; writing—original draft preparation, M.O. and O.B.; writing—review and editing, M.O. and O.B.; visualization, O.B.; supervision, O.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Publicly available datasets were used in this study. ORE 2015 is available at https://doi.org/10.5281/zenodo.18578 (accessed on 17 January 2026), MOWLCorp at https://doi.org/10.5281/zenodo.10851 (accessed on 17 January 2026), LUBM (Univ-Bench OWL) at https://swat.cse.lehigh.edu/onto/univ-bench.owl (accessed on 17 January 2026) (project page: https://swat.cse.lehigh.edu/projects/lubm/) (accessed on 17 January 2026), UOBM generator at https://www.cs.ox.ac.uk/isg/tools/UOBMGenerator/ (accessed on 17 January 2026), and OWL2Bench at https://doi.org/10.5281/zenodo.4764368 (accessed on 17 January 2026) (code: https://github.com/kracr/owl2bench (accessed on 17 January 2026)). All code, derived task instances, and evaluation outputs for this paper are available at https://github.com/okanss/DL-ReasonSuite (accessed on 17 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

General
AI	Artificial Intelligence
LLM	Large Language Models
DL	Description Logic
LLM Related
NLP	Natural Language Processing
RAG	Retrieval-Augmented Generation
CoT	Chain-of-Thought
ToT	Tree-of-Thought
ReAct	Reasoning and Acting
SFT	Supervised Fine-Tuning
RLHF	Reinforcement Learning from Human Feedback
DPO	Direct Preference Optimization
LoRA	Low-Rank Adaptation
PEFT	Parameter-Efficient Fine-Tuning
ICL	In-Context Learning
KG	Knowledge Graph
NER	Named Entity Recognition
QA	Question Answering
NLI	Natural Language Inference
OOD	Out-of-Distribution
AUC	Area Under the Curve
F1	F1-Score
ECE	Expected Calibration Error
Description Logics/OWL/reasoning
OWL	Web Ontology Language
RDF	Resource Description Framework
RDFS	RDF Schema
SPARQL	SPARQL Protocol and RDF Query Language
IRI	Internationalized Resource Identifier
TBox	Terminological Box
ABox	Assertional Box
RBox	Role Box
KB	Knowledge Base
SAT	Satisfiable/Satisfiability
UNSAT	Unsatisfiable/Unsatisfiability
CWA	Closed World Assumption
OWA	Open World Assumption
UNA	Unique Name Assumption
FOL	First-Order Logic
FOPL	First-Order Predicate Logic
DL-Lite	Description Logic Lite family
EL	DL fragment $EL$
ALC	Attributive Language with Complements $ALC$
SROIQ	DL underpinning OWL 2 DL
HermiT	OWL 2 DL Reasoner (tool name)
Pellet	OWL Reasoner (tool name)
FaCT++	OWL Reasoner (tool name)
LUBM	Lehigh University Benchmark
UOBM	University Ontology Benchmark
ORE	Ontology Reasoner Evaluation (competition/corpus)
MOWL	Manchester OWL Corpus (often “MOWLCorp”)

Appendix A

Algorithm A1 Overall Model Scoring Algorithm

Require:

1: ●

m o d e l_d a t a = {metrics, metadata}

●

s c h e m a = {categories}

Ensure:

2: ●

o v e r a l l_s c o r e \in [0, 100]

●

c a t e g o r y_s c o r e s

●

n o r m a l i z e d_m e t r i c s

3: Step 1: Normalize Metrics

4: Initialize

n o r m a l i z e d_m e t r i c s \leftarrow Ø

5: for all

(k, r a w_v a l u e) \in m o d e l_d a t a . m e t r i c s

do

6: if

r a w_v a l u e = None

then

7:

n o r m a l i z e d_m e t r i c s [k] \leftarrow None

8: else if

k = “ codeforces ”

then

9:

s c o r e_t y p e \leftarrow m o d e l_d a t a . m e t a d a t a . g e t (“ codeforces_score_type ”, “ percentile ”)

10:

n o r m a l i z e d_m e t r i c s [k] \leftarrow n o r m a l i z e_c o d e f o r c e s (r a w_v a l u e, s c o r e_t y p e)

11: else

12:

n o r m a l i z e d_m e t r i c s [k] \leftarrow c l a m p (r a w_v a l u e, 0, 100)

13: end if

14: end for

15: Step 2: Compute Category Scores

16: Initialize

c a t e g o r y_s c o r e s \leftarrow Ø

17: for all

c a t e g o r y_n a m e \in s c h e m a . c a t e g o r i e s

do

18:

v a l i d_s c o r e s \leftarrow Ø

19: for all

m e t r i c \in s c h e m a . c a t e g o r i e s [c a t e g o r y_n a m e] . m e t r i c s

do

20:

v \leftarrow n o r m a l i z e d_m e t r i c s [m e t r i c . k e y]

21: if

v \neq None

then

22: Append v to

v a l i d_s c o r e s

23: end if

24: end for

25: if

v a l i d_s c o r e s \neq Ø

then

26:

c a t e g o r y_s c o r e s [c a t e g o r y_n a m e] \leftarrow m e a n (v a l i d_s c o r e s)

27: else

28:

c a t e g o r y_s c o r e s [c a t e g o r y_n a m e] \leftarrow 0.0

29: end if

30: end for

31: Step 3: Compute Overall Score

32:

t o t a l_s c o r e \leftarrow 0.0

33:

t o t a l_w e i g h t \leftarrow 0.0

34: for all

(c a t e g o r y_n a m e, S_{c}) \in c a t e g o r y_s c o r e s

do

35:

w_{c} \leftarrow s c h e m a . c a t e g o r i e s [c a t e g o r y_n a m e] . w e i g h t

36: if

S_{c} > 0

then

37:

t o t a l_s c o r e \leftarrow t o t a l_s c o r e + S_{c} \times w_{c}

38:

t o t a l_w e i g h t \leftarrow t o t a l_w e i g h t + w_{c}

39: end if

40: end for

41: if

t o t a l_w e i g h t > 0

then

42:

o v e r a l l_s c o r e \leftarrow t o t a l_s c o r e / t o t a l_w e i g h t

43: else

44:

o v e r a l l_s c o r e \leftarrow 0.0

45: end if

46: Return

o v e r a l l_s c o r e

,

c a t e g o r y_s c o r e s

,

n o r m a l i z e d_m e t r i c s

Appendix B

Table A1. Empirical difficulty tiers (observed hardness) with subtrack breakdown: task counts and DL-Core.

Tier	#Tasks	Core	Query	Bridge-NL	Bridge-Entail	Kimi	Llama-N	DeepSeek	Phi+	Phi
Easy	1075	905	20	60	90	100.0 [100.0, 100.0]	100.0 [100.0, 100.0]	100.0 [100.0, 100.0]	100.0 [100.0, 100.0]	100.0 [100.0, 100.0]
Medium	1075	905	20	60	90	83.5 [81.2, 86.0]	90.4 [88.4, 92.2]	87.2 [84.9, 89.3]	80.3 [77.8, 82.9]	74.0 [71.0, 76.6]
Hard	1075	905	20	60	90	78.3 [75.6, 81.0]	83.5 [81.0, 85.9]	78.7 [75.9, 81.3]	72.2 [69.1, 75.0]	66.1 [63.0, 69.2]
Very challenging	1075	905	20	60	90	56.4 [52.9, 59.4]	64.1 [60.9, 67.4]	56.5 [53.4, 59.8]	48.3 [45.1, 51.6]	42.3 [39.1, 45.6]

Appendix C

Table A2. Empirical difficulty tiers (observed hardness): DL-Query.

Tier	#Tasks	Kimi	Llama-N	DeepSeek	Phi+	Phi
Easy	1075	94.2 [90.8, 96.6]	96.2 [95.3, 97.3]	94.6 [91.1, 96.9]	89.9 [84.6, 94.6]	92.1 [87.8, 95.5]
Medium	1075	81.2 [67.1, 91.9]	91.1 [86.5, 95.3]	80.7 [66.5, 91.9]	89.0 [83.1, 94.1]	66.2 [50.2, 81.3]
Hard	1075	82.6 [77.0, 88.3]	82.8 [77.0, 88.8]	75.8 [63.0, 86.6]	55.6 [38.8, 71.2]	54.1 [37.6, 69.7]
Very challenging	1075	43.2 [28.2, 59.2]	59.4 [45.4, 73.3]	64.8 [49.0, 79.2]	35.4 [19.9, 51.2]	58.1 [43.2, 70.5]

Appendix D

Table A3. Empirical difficulty tiers (observed hardness): DL-Bridge-NL.

Tier	#Tasks	Kimi	Llama-N	DeepSeek	Phi+	Phi
Easy	1075	100.0 [100.0, 100.0]	100.0 [100.0, 100.0]	100.0 [100.0, 100.0]	100.0 [100.0, 100.0]	100.0 [100.0, 100.0]
Medium	1075	94.7 [88.3, 99.7]	94.9 [88.3, 99.9]	96.5 [91.4, 99.9]	91.5 [84.7, 98.2]	86.5 [77.9, 94.7]
Hard	1075	81.7 [71.7, 91.7]	83.3 [73.3, 91.7]	70.0 [58.3, 81.7]	80.0 [70.0, 90.0]	85.0 [75.0, 93.3]
Very challenging	1075	56.7 [43.3, 68.3]	68.3 [56.6, 79.8]	56.7 [43.3, 70.0]	61.6 [48.3, 73.3]	41.6 [29.9, 54.9]

References

Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21), Virtual Event, 3–10 March 2021; ACM: New York, NY, USA, 2021; pp. 610–623. [Google Scholar]
Gendron, G.; Bao, Q.; Witbrock, M.; Dobbie, G. Large language models are not robust reasoners. arXiv 2023, arXiv:2305.19555. [Google Scholar]
McCoy, R.T.; Pavlick, E.; Linzen, T. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Vienna, Austria, 2019; pp. 3428–3448. [Google Scholar]
Baader, F.; Calvanese, D.; McGuinness, D.L.; Nardi, D.; Patel-Schneider, P.F. (Eds.) The Description Logic Handbook: Theory, Implementation and Applications; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Cuenca Grau, B.; Horrocks, I.; Motik, B.; Parsia, B.; Patel-Schneider, P.F.; Sattler, U. OWL 2: The next step for OWL. J. Web Semant. 2008, 6, 309–322. [Google Scholar] [CrossRef]
Horrocks, I.; Kutz, O.; Sattler, U. The even more irresistible SROIQ. In Proceedings of the 10th International Conference on Principles of Knowledge Representation and Reasoning (KR 2006), Lake, UK, 2–5 June 2006. [Google Scholar]
Sirin, E.; Parsia, B.; Cuenca Grau, B.; Kalyanpur, A.; Katz, Y. Pellet: A practical OWL-DL reasoner. J. Web Semant. 2007, 5, 51–53. [Google Scholar] [CrossRef]
Glimm, B.; Horrocks, I.; Motik, B.; Stoilos, G.; Wang, Z. HermiT: An OWL 2 reasoner. J. Autom. Reason. 2014, 53, 245–269. [Google Scholar] [CrossRef]
Parsia, B.; Matentzoglu, N.; Gonçalves, R.S.; Glimm, B.; Steigmiller, A. The OWL Reasoner Evaluation (ORE) 2015 Competition Report. In Proceedings of the 11th International Workshop on Scalable Semantic Web Knowledge Base Systems, Bethlehem, PA, USA, 11 October 2015; pp. 2–15. [Google Scholar]
Guo, Y.; Pan, Z.; Heflin, J. LUBM: A benchmark for OWL knowledge base systems. J. Web Semant. 2005, 3, 158–182. [Google Scholar] [CrossRef]
Ma, L.; Yang, Y.; Qiu, Z.; Xie, G.; Pan, Y.; Liu, S. Towards a complete OWL ontology benchmark. In Proceedings of the 3rd European Semantic Web Conference (ESWC 2006), Budva, Montenegro, 11–14 June 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 125–139. [Google Scholar]
Singh, G.; Kumar, A.; Bhagat, K.; Bhatia, S.; Mutharaju, R. OWL2Bench: Towards a customizable benchmark for OWL 2 reasoners. In Proceedings of the 19th International Semantic Web Conference (ISWC 2020), Athens, Greece, 2–6 November 2020; pp. 344–349. [Google Scholar]
Matentzoglu, N.; Bail, S.; Parsia, B. A corpus of OWL DL ontologies. In Proceedings of the OWL: Experiences and Directions Workshop (OWLED 2013), Montpellier, France, 26–27 May 2013. [Google Scholar]
Teyou, L.M.K.; Friedrichs, L.; Kouagou, N.J.; Demir, C.; Mahmood, Y.; Heindorf, S.; Ngonga Ngomo, A.C. Neural Description Logic Reasoning over Incomplete Knowledge Bases. In Proceedings of the Thirteenth International Conference on Learning Representations ICLR 2025, Singapore, 24–28 April 2025; Available online: https://openreview.net/forum?id=4qRCiEZGKd (accessed on 17 January 2026).
Li, Z.; Huang, M.; Zhong, Y.; Qin, Y. A Description Logic Based Ontology for Knowledge Representation in Process Planning for Laser Powder Bed Fusion. Appl. Sci. 2022, 12, 4612. Available online: https://www.mdpi.com/2076-3417/12/9/4612 (accessed on 11 January 2026). [CrossRef]
Borgwardt, S.; De Bortoli, F.; Koopmann, P. The Precise Complexity of Reasoning in $ALC$ with $D$ -Admissible Concrete Domains. In Proceedings of the 37th International Workshop on Description Logics (DL 2024), Bergen, Norway, 18–21 June 2024. [Google Scholar]
Lutz, C.; Schulze, L. Description Logics with Abstraction and Refinement: From ALC to EL. In Proceedings of the 21st International Conference on Principles of Knowledge Representation and Reasoning, Hanoi, Vietnam, 2–8 November 2024; pp. 542–552. [Google Scholar]
Manière, Q.; Przybyłko, M. Spectra of Cardinality Queries over Description Logic Knowledge Bases. In Proceedings of the AAAI Conference on Artificial Intelligence 2025, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 15067–15074. [Google Scholar]
Zhang, H.; Jiang, G.; Quan, D. A Theory of Formalisms for Representing Knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence 2025, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 15257–15264. [Google Scholar]
Di Stefano, F.; Manière, Q.; Ortiz, M.; Šimkus, M. Minimal model reasoning in description logics: Don’t try this at home! arXiv 2025, arXiv:2508.05350. [Google Scholar]
Morawska, B.; Marzec, D. Solving Unification in the Description Logic FL_⊥. In Proceedings of the 22nd International Conference on Principles of Knowledge Representation and Reasoning, Melbourne, Australia, 11–17 November 2025; pp. 467–476. [Google Scholar]
Di Stefano, F.; Šimkus, M. Stable Model Semantics for Description Logic Terminologies. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volmue 38, pp. 10484–10492. [Google Scholar]
Casini, G.; Haldimann, J.; Meyer, T. Reasoning in defeasible description logics with system W and lexicographic inference. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning, Melbourne, Australia, 11–17 November 2025; Volume 22, pp. 218–228. [Google Scholar]
Koopmann, P. Explaining Reasoning Results for Description Logic Ontologies. In Joint Proceedings of the 20th and 21st Reasoning Web Summer Schools (RW 2024 & RW 2025); Schloss Dagstuhl–Leibniz-Zentrum für Informatik: Wadern, Germany, 2025; pp. 6:1–6:29. [Google Scholar]
Craig, W. Linear reasoning: A new form of the Herbrand–Gentzen theorem. J. Symb. Log. 1957, 22, 250–268. [Google Scholar] [CrossRef]
Koopmann, P.; Mahmood, Y.; Ngomo, A.C.N.; Tiwari, B. Can You Tell the Difference? Contrastive Explanations for ABox Entailments. arXiv 2025, arXiv:2511.11281. [Google Scholar] [CrossRef]
Luo, X.; Li, H.; Lee, S. Bridging the gap: Neuro-Symbolic Computing for advanced AI applications in construction. Front. Eng. Manag. 2023, 10, 727–735. [Google Scholar] [CrossRef]
Li, H.; Cohan, A.; Witbrock, M.J.; Yang, M.; Liu, F.; Lin, Z.; van Benthem, J.; Clark, P. Workshop on Logical Reasoning of Large Language Models. In Proceedings of the ICLR 2026 Workshop Proposals, The Fourteenth International Conference on Learning Representations, ICLR 2026, Rio de Janeiro, Brazil, 23–27 April 2026. [Google Scholar]
Link, V.; Lohmann, S.; Haag, F. OntoBench: Generating custom OWL 2 benchmark ontologies. In Proceedings of the International Semantic Web Conference (ISWC), Kobe, Japan, 17–21 October 2016; Springer: Cham, Switzerland, 2016; pp. 122–130. [Google Scholar]
Fang, M.; Deng, S.; Zhang, Y.; Shi, Z.; Chen, L.; Pechenizkiy, M.; Wang, J. Large Language Models Are Neurosymbolic Reasoners. In Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI 2024), Vancouver, BC, Canada, 20–27 February 2024; pp. 17985–17993. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2022, arXiv:2201.11903. [Google Scholar]
Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models. In Proceedings of the 11th International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Pan, L.; Albalak, A.; Wang, X.; Wang, W.Y. Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning. In Findings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023); Association for Computational Linguistics: Singapore, 2023; pp. 3806–3824. [Google Scholar]
Olausson, T.; Gu, A.; Lipkin, B.; Zhang, C.; Solar-Lezama, A.; Tenenbaum, J.B.; Levy, R. 1LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), Singapore, 6–10 December 2023; pp. 5153–5176. [Google Scholar]
Liu, J.; Cui, L.; Liu, H.; Huang, D.; Wang, Y.; Zhang, Y. LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI 2020), Yokohama, Japan, 11–17 July 2020; pp. 3622–3628. [Google Scholar]
Tafjord, O.; Dalvi, B.; Clark, P. ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), Virtual Event, 7–11 November 2021; pp. 3621–3634. [Google Scholar]
Yu, W.; Jiang, Z.; Dong, Y.; Feng, J. ReClor: A reading comprehension dataset requiring logical reasoning. arXiv 2020, arXiv:2002.04326. [Google Scholar] [CrossRef]
Parmar, M.; Patel, N.; Varshney, N.; Nakamura, M.; Luo, M.; Mashetty, S.; Mitra, A.; Baral, C. LogicBench: Towards systematic evaluation of logical reasoning ability of large language models. arXiv 2024, arXiv:2404.15522. [Google Scholar]
Chung, T.T.; Liu, L.; Yu, M.; Yeung, D.Y. DivLogicEval: A framework for benchmarking logical reasoning evaluation in large language models. In Findings of the Association for Computational Linguistics: EMNLP; Association for Computational Linguistics: Miami, FL, USA, 2025; pp. 901–915. [Google Scholar]
Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-consistency improves chain of thought reasoning in language models. In Proceedings of the 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]

Figure 1. Weight distribution of reasoning categories used to compute the overall evaluation score.

Figure 2. Overall weighted performance scores of evaluated LLMs across all reasoning categories.

Figure 3. Average model performance across DL-Core, DL-Query, and DL-Bridge reasoning tracks.

Figure 4. DL reasoning accuracy by task type across all evaluated models.

Figure 5. Comparison of SPARQL query performance against average performance on other DL reasoning tasks.

Figure 6. Performance drop-off with increasing observed difficulty (mean ± 95% bootstrap CI). Each panel corresponds to a reasoning subtrack (DL-Core, DL-Query, DL-Bridge-NL, DL-Bridge-Entail). Difficulty tiers are defined by empirical hardness (quartiles based on per-item mean score across models). All models exhibit a consistent decline as task difficulty increases, indicating reduced robustness under harder items.

Figure 7. Model ranking changes before and after incorporating DL reasoning metrics.

Figure 8. Relationship between DL reasoning and math reasoning performance across models.

Table 1. Performance of five reasoning LLMs on DL-ReasonSuite. Scores are the percentage of tasks answered correctly (accuracy) in each track and overall. The highest score in each column is bolded for emphasis.

Model	DL-Core	DL-Query	DL-Bridge	Overall
Phi-4 Reasoning Plus	92%	80%	86%	85%
Phi-4 Reasoning	90%	78%	61%	76%
DeepSeek-R1	85%	75%	55%	72%
Llama-Nemotron Ultra	80%	68%	53%	68%
Kimi k1.5	76%	62%	45%	61%

Table 2. Difficulty-stratified performance using observed hardness (empirical tiers). Entries report mean score with 95% bootstrap CI.

Difficulty Tier	Kimi k1.5	Llama-Nemotron	DeepSeek-R1	Phi-4 (Plus/Base)
Easy	99.7% [99.4, 99.9]	99.8% [99.6, 100.0]	99.7% [99.4, 99.9]	99.8% [99.7, 99.9]/
				99.7% [99.4, 99.9]
Medium	84.1% [81.9, 86.2]	90.2% [88.4, 92.0]	87.7% [85.8, 89.7]	81.2% [78.8, 83.4]/
				73.5% [70.9, 76.2]
Hard	77.8% [75.3, 80.2]	83.3% [81.0, 85.5]	77.0% [74.5, 79.5]	71.7% [69.0, 74.4]/
				66.7% [63.8, 69.4]
Very challenging	56.0% [53.0, 59.0]	63.9% [60.9, 66.8]	56.1% [53.3, 59.0]	48.0% [45.1, 50.9]/
				42.6% [39.6, 45.6]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Oluçoğlu, M.; Bursa, O. DL-ReasonSuite: A Benchmark for Evaluating Description Logic Reasoning in Large Language Models. Appl. Sci. 2026, 16, 1821. https://doi.org/10.3390/app16041821

AMA Style

Oluçoğlu M, Bursa O. DL-ReasonSuite: A Benchmark for Evaluating Description Logic Reasoning in Large Language Models. Applied Sciences. 2026; 16(4):1821. https://doi.org/10.3390/app16041821

Chicago/Turabian Style

Oluçoğlu, Müge, and Okan Bursa. 2026. "DL-ReasonSuite: A Benchmark for Evaluating Description Logic Reasoning in Large Language Models" Applied Sciences 16, no. 4: 1821. https://doi.org/10.3390/app16041821

APA Style

Oluçoğlu, M., & Bursa, O. (2026). DL-ReasonSuite: A Benchmark for Evaluating Description Logic Reasoning in Large Language Models. Applied Sciences, 16(4), 1821. https://doi.org/10.3390/app16041821

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DL-ReasonSuite: A Benchmark for Evaluating Description Logic Reasoning in Large Language Models

Abstract

1. Introduction

2. Materials and Methods

2.1. Benchmark Overview and Design

2.1.1. DL-Core Track

2.1.2. DL-Query Track

2.1.3. DL-Bridge Track

2.2. Task Generation and Dataset Preparation

2.3. Tool Invocation and Experimental Settings

3. Results

3.1. Performance by Track

3.2. Difficulty-Stratified Performance Analysis

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix B

Appendix C

Appendix D

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI