Geometry Diagram Parsing and Reasoning Based on Deep Semantic Fusion

Jian, Pengpeng; Zhang, Xuhui; Wu, Lei; Ma, Bin; Hong, Wangyang

doi:10.3390/sym18010092

Open AccessArticle

Geometry Diagram Parsing and Reasoning Based on Deep Semantic Fusion

by

Pengpeng Jian

^1,†,

Xuhui Zhang

^1,*

,

Lei Wu

¹,

Bin Ma

^1,† and

Wangyang Hong

²

¹

Information Engineering Institute, North China University of Water Resources and Electric Power, Zhengzhou 450046, China

²

Zhongshui Culture Technology (Zhengzhou) Co., Ltd., Zhengzhou 450046, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Symmetry 2026, 18(1), 92; https://doi.org/10.3390/sym18010092

Submission received: 16 September 2025 / Revised: 2 October 2025 / Accepted: 8 October 2025 / Published: 4 January 2026

(This article belongs to the Special Issue Symmetry and Asymmetry in Human-Computer Interaction)

Download

Browse Figures

Versions Notes

Abstract

Effective Automated Geometric Problem Solving (AGP) requires a deep integration of visual perception and textual comprehension. To address this, we propose a dual-stream fusion model that injects deep semantic understanding from a Pre-trained Language Model (PLM) into the geometric diagram parsing pipeline. Our core innovation is a Semantic-Guided Cross-Attention (SGCA) mechanism, which uses the global semantic intent of the problem text to direct attention toward key visual primitives. This yields context-enriched visual representations that serve as inputs to a Graph Neural Network (GNN), enabling relational reasoning that is not only perception-driven but also context-aware. By explicitly bridging the semantic gap between text and diagrams, our approach delivers more robust and accurate predictions. To the best of our knowledge, this is the first study to introduce a semantic-guided cross-attention mechanism into geometric diagram parsing, establishing a new paradigm that effectively addresses the cross-modal semantic gap and achieves state-of-the-art performance. This is particularly effective for parsing problems involving geometric symmetries, where textual cues often clarify or define symmetrical relationships not obvious from the diagram alone.

Keywords:

geometric problem solving; multimodal learning; geometric diagram parsing; cross-attention; DistilBERT

1. Introduction

Automated Geometric Problem Solving (AGP) is a long-standing challenge in artificial intelligence. It holds immense potential for applications such as intelligent education platforms, adaptive learning systems, and computer-aided instruction tools [1]. A typical geometry problem consists of a diagram and accompanying text, which together define the complete problem information. The primary and critical step towards solving such problems is Geometric Diagram Parsing, a task aimed at producing a structured interpretation of the diagram by identifying its geometric primitives (e.g., points, lines, circles) and determining their interrelations. Crucially, many geometric problems hinge on recognizing explicit or implicit symmetrical properties, such as congruent triangles, equal line segments, or angle bisectors. The ability to accurately parse these symmetrical relationships is therefore fundamental to successful automated reasoning.

Consider a simple geometry problem: a diagram shows two lines that appear to be parallel, but the accompanying text explicitly states, “line AB is perpendicular to line CD.” A model relying solely on visual cues would almost certainly infer a “parallel” relationship, leading to a cascade of incorrect deductions. Conversely, a system that cannot ground the textual phrase “line AB” to the corresponding visual primitives is equally ineffective. This simple scenario highlights the critical challenge: true geometric understanding emerges only from the seamless synthesis of visual perception and linguistic comprehension. The inability to bridge this “semantic gap” is the primary bottleneck hindering the performance and reliability of current AGP systems, a challenge our work directly confronts.

Early explorations in AGP centered on symbolic reasoning and expert systems [2]. These methods relied on hand-crafted rules and logical axioms to represent geometric knowledge. However, their fundamental limitation was a lack of flexibility. While proficient in formal logical deduction, they struggled to handle the inherent ambiguity and noise present in real-world diagrams and text. This inherent fragility hindered their scalability and constrained their practical utility.

The advent of computer vision and machine learning prompted a paradigm shift. Subsequent approaches began to incorporate traditional image processing techniques. Early methods like the Hough transform were used for primitive detection, but this was later refined by more robust techniques such as the Probabilistic Hough Transform [3] and robust primitive fitting with RANSAC [4]. More advanced detectors like the Line Segment Detector (LSD) [5] offered further improvements. Systems like GeoS [2] combined text parsing with diagram interpretation, but their reliance on rule-based or template-driven parsing limited their ability to handle the diverse expressions found in natural language. Such methods generally failed to bridge the “semantic gap” between raw visual features and high-level concepts described in the text.

Today, deep learning models, particularly Graph Neural Networks (GNNs), have become the dominant technical approach. Models like PGDPNet [6] have demonstrated exceptional performance in parsing well-structured diagrams. Their core weakness, however, lies in an over-reliance on purely visual information, which prevents them from effectively integrating the full context provided by natural language descriptions. This limitation becomes especially pronounced when the diagram itself is ambiguous. For example, a critical piece of information, such as a line being an angle bisector, might only be stated in the text. A model that ignores this textual cue is prone to making incorrect assumptions, thereby invalidating all subsequent reasoning. This disconnect between visual and textual understanding—known as the semantic gap—remains the central obstacle to advancing AGP systems [7].

To formalize this challenge, the task of multimodal geometric diagram parsing is defined as follows: Given a geometry problem instance as a pair

(I, T)

, where

I

is the diagram image and

T

is the accompanying text, the objective is to train a model

f

that outputs a structured semantic graph

G = (P, R)

. Here,

P = {p_{1}, . . ., p_{n}}

is a set of

n

geometric primitives detected from image

I

, where each primitive

p_{i}

includes attributes like its type and coordinates.

R

is a set of relations among these primitives, where each relation

r \in R

is a tuple

(p_{i}, p_{j}, l_{k})

, signifying a relation of type

l_{k}

(e.g.,

l_{k}

∈ {Parallel, Perpendicular}) exists between primitives

p_{i}

and

p_{j}

. The principal challenge resides in the fact that an accurate inference of

P

and

R

necessitates a joint, holistic understanding of

I

and

T

, as either modality in isolation may present information that is insufficient or ambiguous.

In response, the central thesis of this paper is that a truly effective AGP system must integrate powerful visual perception with deep textual comprehension. Based on this principle, we propose a novel dual-stream fusion model that deeply integrates a Pre-trained Language Model (PLM) into the geometric diagram parsing pipeline. The main contributions of this work are threefold:

Innovative Dual-Stream Fusion Architecture: We design an end-to-end framework that effectively injects deep textual semantics into a state-of-the-art visual parsing pipeline, specifically engineered to address the semantic gap in geometric reasoning.
Semantic-Guided Cross-Attention Mechanism: We introduce a novel fusion mechanism that translates the global semantic “intent” of the problem text into dynamic attention weights in the visual space, facilitating efficient and interpretable diagram-text alignment.
Semantically-Enriched Graph Reasoning: We establish a new paradigm where the GNN reasons over a semantically-enriched graph. By feeding the GNN context-aware node features from our SGCA mechanism, we transform the reasoning process from being purely perception-based to context-aware, enabling more robust and accurate relational inference.

2. Related Work

This research lies at the intersection of three domains: geometric diagram parsing, natural language understanding, and multimodal reasoning. The following sections review relevant literature from these perspectives [8].

2.1. Geometric Diagram Parsing

The interpretation of geometric diagrams is a foundational task in AGP. Early techniques often relied on traditional computer vision methods, such as using the Hough transform to detect simple geometric primitives. However, these methods are sensitive to image quality and perform poorly on complex or cluttered diagrams. The emergence of deep learning transformed the research landscape. Modern approaches commonly reframe primitive detection as a standard object detection task, employing powerful convolutional neural networks like Feature Pyramid Networks (FPN) [9] as backbones.

A state-of-the-art example in this area is PGDPNet, which also serves as the visual foundation for our work. PGDPNet pioneered a two-stage framework: in the first stage, an object detection module (based on FCOS [10]) identifies geometric primitives; in the second, it constructs a graph from these primitives and leverages a GNN [11] to infer their interrelations. From a purely visual standpoint, PGDPNet exhibits outstanding performance in diagram parsing. Its primary limitation, however, is its failure to incorporate the rich semantic information embedded in the accompanying problem text. This deficiency makes it prone to errors in scenarios where the diagram is ambiguous or information is incomplete—a gap our work aims to fill. Recent generative approaches have also explored unified frameworks for this task, further highlighting the need for robust multimodal integration [12,13].

2.2. Text Understanding in Geometry Problems

Parallel to diagram parsing, understanding the problem text is equally crucial. Early systems like GeoS [2] and Inter-GPS [14] primarily used rule-based parsers or formal language grammars to extract information from text. While effective for structured or simplified language, this approach lacks the flexibility to handle the full diversity of natural language.

The development of Pre-trained Language Models (PLMs), particularly BERT [15,16] and its variants like DistilBERT [17], has revolutionized the field of natural language understanding. These models, pre-trained on massive text corpora, can capture deep contextual dependencies far beyond surface-level pattern matching [18,19]. Their powerful semantic representation capabilities have opened a promising path to overcoming the limitations of older text parsing techniques in the geometry domain. Our research is among the first to deeply integrate a powerful PLM into a GNN-based parsing framework for this task [20,21].

2.3. Multimodal Fusion

Effectively solving geometry problems requires the intelligent fusion of information from both diagrams (vision) and text (language). This need aligns with the broader trend of multimodal learning in AI research. In domains such as Visual Question Answering (VQA) [22] and image-text retrieval, sophisticated fusion models like LXMERT [23], UNITER [24], and ViLBERT [25] have achieved tremendous success [26]. More recent works have further demonstrated the power of learning joint representations from vast amounts of image-text data through contrastive learning paradigms [27]. Other studies have also explored diagram-text fusion from various perspectives [28,29,30,31], and the broader field continues to advance rapidly. Recent large vision-language models, such as BLIP and Flamingo, for instance, have set new benchmarks in joint multimodal understanding, highlighting a clear trend towards more powerful and unified architectures for cross-modal reasoning.

A key mechanism in these models is cross-modal attention [32]. This mechanism allows elements from one modality (e.g., words in a question) to attend to and highlight relevant regions in another modality (e.g., objects in an image), enabling fine-grained, dynamic alignment between text and vision. Inspired by these advancements, our work introduces a novel semantic-guided cross-attention mechanism tailored for the geometry domain. Unlike generic fusion strategies, our approach leverages the high-level semantic “intent” extracted from the entire problem text to guide the focus of the visual parser, aiming for a more targeted and efficient integration of diagram and text.

3. Methodology

This section details the architectural design and core mechanisms of our proposed model. We first present the overall framework, followed by detailed descriptions of the modality-specific encoders, our novel Semantic-Guided Cross-Attention (SGCA) fusion module, and the final graph-based reasoning network.

To bridge the semantic gap between visual diagrams and natural language, our model employs a deep integration strategy that combines a text encoding module with a visual parsing framework. As shown in Figure 1, the architecture takes a diagram-text pair as input, processes them through parallel visual and textual streams, and merges these streams in a novel fusion stage where semantic information from the text guides the interpretation of the diagram. This process yields context-aware visual features, which are then utilized by a graph-based reasoning network to determine the final geometric relations.

3.1. Overall Architecture

Our model processes diagram-text pairs as input. Its architecture consists of two primary data streams: a visual pathway and a textual pathway. The visual pathway is responsible for identifying geometric primitives in the diagram and extracting their low-level visual features, leveraging the powerful front-end of the PGDPNet model [6]. Concurrently, the textual pathway uses a pre-trained Transformer model, DistilBERT [17] to process the natural language description of the problem. Its goal is to generate a single, high-level semantic vector that captures the global context and intent.

Our architectural design deliberately adopts a unidirectional, text-to-vision guidance paradigm rather than a more complex bidirectional co-attention mechanism. This choice is rooted in the fundamental nature of geometry problems, addressing the reviewer’s query about alternative fusion strategies. Unlike general VQA tasks where text and image might contribute equally and interactively, in geometry parsing, the text often serves a specific, hierarchical role: it provides definitive constraints, clarifications, or information entirely absent from the diagram. Therefore, our model is designed to treat the textual representation as a high-level “instruction set” that modulates and refines the interpretation of the visual data. While a bidirectional approach where visual information could disambiguate text is theoretically possible, it is less common in this domain. This focused approach not only proves highly effective for resolving ambiguities but also offers greater computational efficiency and interpretability compared to more entangled fusion strategies.

The innovation of our model lies in how these two streams are fused. We introduce a Semantic-Guided Cross-Attention (SGCA) mechanism as the core fusion module. This module uses the textual semantic vector as a query to dynamically re-weight the visual features, effectively allowing the text to guide the model’s attention to the most relevant primitives in the diagram. The resulting semantically-enriched visual features are then organized into a graph structure. Finally, a Graph Neural Network (GNN) performs reasoning on this graph to model complex interactions between primitives and predict their relationships, generating the final structured interpretation of the problem.

3.2. Modality-Specific Feature Encoding

To effectively fuse information from the visual and textual modalities, we first need to generate powerful and structured representations for each. This section details the distinct encoding pathways for processing the geometric diagram and its accompanying text.

3.2.1. Visual Pathway: Primitive Feature Extraction

The foundation of our visual understanding is built upon the robust parsing capabilities of a state-of-the-art geometric diagram parser, specifically the front-end of PGDPNet. This pathway is responsible for identifying geometric primitives and extracting their initial, context-agnostic visual features.

Given a diagram image

I

, the visual encoder, denoted as

Φ_{v i s}

, identifies a set of

N

geometric primitives

P = {p_{1}, p_{2}, . . ., p_{N}}

. For each detected primitive

p_{i}

, the model extracts a corresponding

D_{v}

-dimensional feature vector

v_{i}

from the convolutional feature maps, encapsulating its localized visual appearance.

The process can be formalized as:

{(p_{1}, v_{1}), (p_{2}, v_{2}), \dots, (p_{N}, v_{N})} = Φ_{v i s} (I)

(1)

where

v_{i} \in R^{D_{v}}

. Here, each feature vector

v_{i}

is of dimension

D_{v}

, representing the visual characteristics of the i-th primitive.

To create a unified representation for subsequent processing, we aggregate these individual primitive features into a single visual feature matrix, denoted as

V

. This is achieved by stacking the

N

feature vectors:

V = [\begin{matrix} - & v_{1}^{T} & - \\ - & v_{2}^{T} & - \\ ⋮ \\ - & v_{N}^{T} & - \end{matrix}] \in R^{N \times D_{v}}

(2)

It is crucial to note that at this stage, the matrix

V

contains purely visual information. It describes the appearance of each primitive in isolation, without incorporating the critical relationships and constraints specified in the accompanying text.

3.2.2. Textual Pathway: Global Semantic Encoding

The textual pathway is designed to distill the problem description into a single, comprehensive vector representing its global semantic context. We employ DistilBERT [17], a lightweight and efficient variant of the BERT model, for this task.

Let the input text be a sequence of tokens

T = (t_{1}, t_{2}, . . ., t_{L})

. After pre-processing with special tokens, the sequence is fed into the DistilBERT model,

Φ_{l a n g}

. The model’s multi-layer Transformer architecture outputs a sequence of final hidden states

H = (h_{[C L S]}, h_{1}, \dots, h_{L}, h_{[S E P]})

, where each

h_{i} \in R^{D_{h}}

and

D_{h}

is the hidden dimension of the language model.

To distill this rich, token-level understanding into a single, actionable vector, we utilize the final hidden state corresponding to the special [CLS] token, as it is pre-trained to aggregate sentence-level semantics. We define this global semantic context vector,

C_{t e x t}

, as:

C_{t e x t} = h_{[C L S]} \in R^{D_{h}}

(3)

This vector

V_{t e x t}

∈ R^(

D_{l}

), where

{(D)}_{l}

is the hidden dimension of the language model, serves as a holistic summary of the text’s meaning.

It effectively captures the core “intent” or “constraints” of the problem statement (e.g., the presence of perpendicularity, bisection, or parallelism), making it an ideal semantic query for guiding the subsequent fusion process. Our choice of DistilBERT is motivated by the need to balance performance with computational efficiency. While specialized models like MathBERT or RoBERTa-math are pre-trained on domain-specific corpora and may offer deeper understanding of mathematical terminology, they also come with a significantly larger parameter count and higher computational overhead. Our experiments (see Section 4.3) demonstrate that DistilBERT, despite being a general-purpose model, provides a powerful enough semantic representation to achieve state-of-the-art results when integrated into our fusion architecture. This makes our approach more accessible and scalable, striking an optimal trade-off for the geometric parsing task.

3.3. Semantic-Guided Multimodal Fusion

The core innovation of our model is the Semantic-Guided Cross-Attention (SGCA) mechanism, designed to effectively bridge the semantic gap between textual descriptions and visual elements. This mechanism situates the fusion task within a standard attention framework [32]. We use the global text vector

C_{t e x t}

generated in Section 3.2 as the sole Query, representing the high-level constraints imposed on the diagram. The visual feature matrix

V \in R^{N \times d_{v}}

, composed of N stacked primitive features, serves as the basis from which the Key and Value sets are derived via distinct learnable linear projections.

Specifically, we define three learnable projection matrices,

W_{Q}

,

W_{K}

, and

W_{V}

, to generate the Query matrix

Q

, Key matrix

K

, and Value matrix

V_{v a l}

:

Q = C_{t e x t} \cdot W_{Q}

(4)

K = V \cdot W_{K}

(5)

V_{v a l} = V \cdot W_{V}

(6)

The attention output is then computed using the scaled dot-product attention formula, where the query

Q

scores the keys

K

, and the resulting scores are used as weights for the values

V_{v a l}

:

A t t e n t i o n (Q, K, V_{v a l}) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) {\cdot V}_{v a l}

(7)

This formulation operationalizes the concept of ‘semantic guidance’: the global intent of the text (

Q

) dynamically interrogates all visual primitives (

K

) to determine the precise allocation of attention for each primitive (

V_{v a l}

) during the construction of the fused representation. Finally, to preserve the original visual information while incorporating the new context, we employ a residual connection that adds the attention output to the initial visual features

V

, followed by Layer Normalization to stabilize training. This complete fusion process, detailed in Algorithm 1, yields the context-enriched feature matrix

V_{f u s e d}

, providing a high-quality input for the subsequent GNN-based reasoning.

Algorithm 1: Semantic-Guided Cross-Attention (SGCA)

Input: Visual feature matrix

V \in R^{N \times d_{v}}

;
Global text context vector

C_{t e x t} \in R^{1 {\times D}_{h}}

;
Learnable matrices

W_{Q} \in R^{D_{h} \times d_{k}}, W_{K} \in R^{D_{v} \times d_{k}}, W_{V} \in R^{D_{v} \times D_{v}}

.
Output: Context-enriched visual feature matrix

V_{f u s e d} \in R^{N \times d_{v}}

.

//Project the global text vector to the Query space.
$Q \leftarrow C_{t e x t} \cdot W_{Q}$
//Project the visual feature matrix to the Key space.
$K \leftarrow V \cdot W_{K}$
//Project the visual feature matrix to the Value space.
$V_{v a l} \leftarrow V \cdot W_{V}$
//Calculate attention scores based on text-vision similarity.
$scores \leftarrow s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}})$
//Create a context summary by weighting values with attention scores.
$V_{a t t n} \leftarrow scores \dots V_{v a l}$
//Additively fuse the context summary with original visual features.
$V_{f u s e d} \leftarrow L a y e r N o r m (V + V_{a t t n})$
//Return the final semantically-enriched features.
return $V_{f u s e d}$

To provide a more intuitive understanding of the SGCA mechanism, it can be conceptualized as a “semantic spotlight.” The global text vector acts as a high-level instruction or query, akin to asking, “Given the textual context of perpendicularity, which visual elements in the diagram are most relevant?” This single query is then compared against all visual primitive features (the keys). The resulting attention scores act as the brightness control for a spotlight, shining most intensely on the visual primitives (the values) that are most pertinent to the textual command. This process effectively filters and re-weights the visual information, ensuring that the features passed to the GNN are not just seen, but are understood in the context of the problem’s textual description.

Comparative Analysis with Existing Methods: The proposed SGCA mechanism differs fundamentally from the cross-attention modules in general-purpose vision-language models like LXMERT [23] and UNITER [24]. First, in terms of guidance strategy, general models often use word tokens from the text as queries to perform fine-grained, many-to-many diagram-text alignment. In contrast, our SGCA uses a single, global sentence vector as the query. This design is tailored to the nature of geometry tasks, where constraints (e.g., “perpendicular,” “angle bisector”) typically reflect the holistic intent of an entire sentence rather than the property of a single word. A global vector more effectively captures this high-level semantic “instruction.” Second, regarding the computational path, SGCA implements a unidirectional, text-to-vision guidance flow with a clear objective: to use text to resolve visual ambiguity. This differs from the complex bidirectional or multi-layer co-attention structures in general models, which aim to learn generic joint multimodal representations. Consequently, our method achieves precise guidance while maintaining higher computational efficiency and stronger task specificity. This global approach is intentionally chosen over token-level attention because geometric constraints are typically holistic; for instance, the concept of “perpendicularity” is conveyed by the entire sentence, not by any single word. Using a global vector prevents the model from being distracted by non-critical words and provides a more stable and focused semantic signal for guiding the visual analysis.

3.4. Graph-Based Relational Reasoning

Once the primitive features are enriched into V_{fused}, the final task is to infer the complex network of relationships among them. We model this problem using a graph-based approach. A fully connected graph is constructed where each of the N primitives corresponds to a node, initialized with its respective context-aware feature vector from V_{fused}. The objective is to predict the class of each edge in this graph, where the class represents the relationship between the two connected nodes.

To perform this complex relational reasoning, we employ a Graph Neural Network (GNN). Analysis of GNN Variant Selection: We adopt the Edge-gated Graph Attention Network (EGAT) from the PGDPNet baseline. This choice is deliberate. Unlike node-centric attention models like GAT [33] or GATv2, which focus on computing the importance of neighboring nodes to a central node, EGAT introduces an edge-gating mechanism. This mechanism allows the model to dynamically control the strength of information flow between nodes based on edge features. In the geometric relation parsing task, the relationship itself is an edge property (e.g., the “parallel” relation between line A and line B). Consequently, EGAT’s edge-centric modeling paradigm is intrinsically aligned with the fundamental nature of the task, enabling it to capture the unique interaction patterns of different relationship types more directly and flexibly. In contrast, models like GraphSAGE, which emphasize neighborhood aggregation, are less advantageous for our small, dense, fully connected graphs. EGAT’s structure gives it a natural advantage in modeling pairwise relations. This aligns with a broader trend of designing geometry-aware GNNs that explicitly model the geometric properties of the underlying data [34,35].

The GNN operates through an iterative message-passing process. In each iteration, every node aggregates feature information from its neighbors, while the EGAT mechanism uses its gates to control the flow of information along each edge. This iterative refinement process enables the model to reason about higher-order dependencies. This entire reasoning pipeline is formalized in Algorithm 2. After a fixed number of iterations, the final node embeddings are used for the final classification of each pairwise relationship. By providing the GNN with semantically rich features from the outset, our model demonstrates a far superior ability to correctly identify relationships that are heavily or entirely dependent on textual descriptions, leading to a more robust and accurate final parsing result.

Algorithm 2: GNN-based Relational Reasoning

Input: Context-enriched feature matrix

V_{f u s e d} \in R^{N \times D_{v}}

;
EGAT model with

L

layers;
Edge classifier

{M L P}_{e d g e}

.
Output: Edge relationship predictions

Y_{e d g e}

.

//Initialize graph node features using the fused matrix for layer 0.
$H^{(0)} \leftarrow V_{f u s e d}$
//Perform iterative message passing for $L$ layers.
$for l = 0 to L - 1$ do
//Refine node features by applying the $l$ -th EGAT layer.
$H^{(l + 1)} \leftarrow {EGAT - Layer}_{l} (H^{(l)})$
end for
//Predict the relationship for each pair of nodes using the final embeddings $H^{(l)}$
for each pair of nodes $(i, j)$ do
//Create an edge representation by concatenating final node embeddings.
$e_{i j} \leftarrow concat (H^{(l)} [i], H^{(l)} [j])$
//Classify the edge representation to get the relationship prediction.
$Y_{edge} [i, j] \leftarrow {MLP}_{edge} (e_{i j})$
end for
//Return the matrix of all predicted relationships.
$return Y_{edge}$

4. Experiments and Analysis

This section presents the empirical evaluation of our model. We begin by describing the experimental setup, including the datasets, evaluation metrics, and implementation details. We then report the main quantitative results, comparing our model against key baselines on both the primary parsing task and downstream applications. Finally, we provide in-depth efficiency, ablation, qualitative, and error analyses to offer a comprehensive understanding of the model’s performance and limitations.

4.1. Datasets, Metrics, and Experimental Setup

Dataset Details: To comprehensively evaluate model performance, we utilized two widely adopted benchmark datasets: PGDP5K [6] and a re-annotated version of IMP-Geometry3K [1]. We acknowledge that focusing on these two datasets is a limitation and that evaluation on a broader range of benchmarks would further strengthen our conclusions. Regarding data quality, although the datasets have undergone multiple rounds of annotation, we estimate that about 2–3% of labels contain noise or represent boundary cases. As pointed out by the reviewer, this inherent noise poses a challenge to the model’s generalization capabilities and could affect the reliability of the results, a factor we consider in our analysis. PGDP5K, for instance, contains approximately 5000 geometry problems with the following characteristics: (1) Primitive and Relation Distribution: On average, each problem contains 15 geometric primitives and 25 relations. Among these, purely visual relations (Geo2Geo) account for about 60%, while text-dependent relations (Text2Geo, Sym2Geo) constitute a combined 35%, highlighting the necessity of multimodal understanding. (2) Sample Difficulty Stratification: We divided the test set into “easy,” “medium,” and “hard” tiers based on the number of primitives and the complexity of textual constraints. Approximately 20% of the “hard” samples feature high diagram-text ambiguity, making them critical for testing model robustness. (3) Label Quality: Although the dataset has undergone multiple rounds of annotation, we estimate that about 2–3% of labels contain noise or represent boundary cases, which challenges the model’s generalization capabilities.

Evaluation Metrics: We strictly adhere to the evaluation metrics defined in the PGDPNet paper [6], primarily using the F1-score (broken down by relation type) for relation parsing and the Full Relation Accuracy (FRA). FRA measures the model’s ability to correctly parse all relations in a problem, serving as the gold standard for assessing overall performance.

Experimental Setup: All models were implemented in PyTorch 1.12.1. The text encoding module utilized the distilbert-base-uncased model from the Hugging Face Transformers library. We compared our model’s performance against two strong baselines: the vision-only PGDPNet [6] and Inter-GPS [14], which combines rule-based parsing. For training, we used the AdamW optimizer with an initial learning rate of 1 × 10⁻⁴, which was decayed using a cosine annealing schedule. Models were trained for a total of 100 epochs with a batch size of 16. All experiments were conducted on a single NVIDIA A6000 GPU.

4.2. Main Results and Comparison

As presented in Table 1, our method establishes a new state-of-the-art, achieving a Full Relation Accuracy (FRA) of 84.8%, a significant 1.6 percentage point improvement over the strong PGDPNet baseline. A detailed breakdown reveals that the most substantial gains are in text-dependent categories: Text2Geo F1 (+0.6%) and Sym2Geo F1 (+0.7%). This directly validates our core hypothesis that deep semantic fusion is critical for resolving textual constraints. We also considered comparing against general-purpose vision-language models such as UNITER, BLIP, or Flamingo. However, directly adapting these large-scale models, which are pre-trained for tasks like VQA or image captioning on natural images, to the highly structured and symbolic domain of geometric diagrams presents significant challenges. Their object detectors and attention mechanisms are not optimized for identifying fine-grained geometric primitives like points and lines. Therefore, we focused our quantitative comparison on domain-specific models like PGDPNet and Inter-GPS, which represent the true state-of-the-art for this specific task. Our method’s superiority over these specialized baselines more convincingly demonstrates its contribution to the field of geometric problem solving.

Crucially, this enhanced parsing capability translates into tangible benefits for downstream applications, as shown in Table 2. The 1.6% FRA improvement leads to a remarkable 4.1 percentage point increase in Proposition Generation accuracy and a 2.6 percentage point gain in final Problem Solving accuracy (from 72.5% to 75.1%). This demonstrates a clear causal link: superior upstream parsing directly enables more effective downstream reasoning.

4.3. Efficiency and Ablation Analysis

We first analyze the performance-cost trade-off of our model, with results detailed in Table 3. Our primary configuration, which uses DistilBERT, achieves a +1.6% accuracy gain (84.8% FRA) over the baseline while maintaining a reasonable parameter count of 78.6 M. To assess the performance ceiling, we also tested a larger BERT-base encoder. While this configuration yielded a marginal further improvement of +0.2%, it came at the cost of a substantial increase in parameters to 122.1 M. This comparison strongly validates our choice of DistilBERT, confirming that our proposed model strikes an excellent balance between state-of-the-art performance and computational efficiency. This substantial +534% increase in parameters is primarily due to the integration of the DistilBERT encoder (66 M parameters). In practical terms, this translates to increased resource requirements. The GPU memory consumption rose from approximately 4.5 GB for the baseline to 9.8 GB for our model during training. Correspondingly, the training time per epoch increased from 15 min to 35 min on our hardware. While this represents a significant overhead, we argue it is a justified trade-off for the notable 1.6% gain in FRA and, more importantly, the 4.1% improvement in the downstream proposition generation task, which highlights the practical value of the enhanced parsing accuracy.

Having established the overall effectiveness and efficiency of our model, we then conducted a series of ablation studies to dissect the architectural sources of this performance gain, as shown in Table 4. The study confirms the value of our core contributions. Starting from the PGDPNet baseline (83.2% FRA), we found that individually integrating our Semantic-Guided Cross-Attention (SGCA) or our Edge-enhanced GNN (EGAT) yielded significant improvements of +1.1% and +1.0%, respectively, both substantially outperforming the simple text fusion strategy (+0.6%). Crucially, when these two components are combined in our full model, they demonstrate a clear synergistic effect, culminating in the final +1.6% performance improvement and validating our integrated architectural design.

Furthermore, Figure 2 displays the loss function convergence curves for our model and the PGDPNet baseline during training. It is evident that our model not only converges faster but also exhibits a more stable convergence state in the later stages of training, indicating superior learning capacity and generalization performance.

4.4. Qualitative and Robustness Analysis

To gain deeper insight into our model’s performance on key challenges, we conducted a series of qualitative and robustness analyses. A typical case of diagram-text ambiguity is shown in Figure 3, where the problem requires reasoning based on a textual description that contradicts the visual representation. The vision-only baseline, unable to comprehend the textual constraint, incorrectly identifies visually similar angles as the solution. In contrast, our model accurately fuses the textual semantics, performs correct logical reasoning, and arrives at the right answer. This case provides a vivid demonstration of our method’s efficacy in resolving ambiguities that are contingent upon textual information.

To systematically evaluate this capability, we categorized such challenges into two main types: implicit relation ambiguity (text states a relation not visually obvious) and attribute specification ambiguity (text assigns precise attributes to visually vague primitives). In targeted tests on these problem types, our model showed dramatic improvement, correctly identifying over 90% of these text-dependent relations, which were consistently missed by the baseline.

Beyond resolving inherent ambiguities, we also assessed the model’s adaptability to real-world imperfect inputs through two controlled experiments. First, in a textual noise injection test (including typos, synonym substitutions, etc.), our model’s FRA dropped by only ~4%, whereas the rule-based Inter-GPS system’s performance degraded by over 15%, showcasing strong robustness to linguistic variations. Second, in a diagram information omission test (randomly erasing one primitive), the vision-only PGDPNet failed completely due to its inability to find the corresponding entity. Our model, guided by the global text, was only slightly affected (FRA drop of ~2%). Taken together, these analyses provide compelling evidence that our semantic-guided fusion strategy not only effectively resolves complex diagram-text ambiguities but also remains robust in the face of imperfect inputs.

4.5. Error Analysis

Despite the significant performance gains, a detailed analysis of failure cases reveals several remaining challenges and provides valuable insights for future improvements. We identified three primary categories of errors where our model still struggles:

Complex Logical Chains in Text: The model is highly effective when textual constraints are stated directly (e.g., “AB is perpendicular to CD”). However, as the reviewer correctly pointed out, it often falters when a key relationship is implied through a multi-step logical chain within the text. For example, if the text states, “Let D be the midpoint of BC. The line perpendicular to BC at D intersects AB at E,” our model may struggle to synthesize these two sentences to infer that line segment ED is perpendicular to BC. The global sentence embedding, while capturing the overall context, can lose the fine-grained, sequential nature of such constructions. Future work could address this by incorporating a more structured reasoning module, such as a graph-based parser for the text itself, or by fine-tuning sequence-to-sequence models to explicitly resolve these multi-step dependencies before fusion.
Visual Grounding Ambiguity in Dense Diagrams: In cluttered diagrams with many similar primitives, the model can fail to correctly link a textual description to the precise visual element. This issue, as highlighted by the reviewer, suggests our fusion mechanism may not be fine-grained enough for highly complex scenarios. For instance, if multiple lines intersect at nearly the same point, and the text specifies a property for one of the resulting angles (e.g., “∠AEB = 90°”), our semantic guidance might correctly focus on the region around point E but still fail to disambiguate between ∠AEB and another visually similar angle, like ∠CED. This highlights a need for even more fine-grained vision-language alignment. A potential solution involves moving from a single global text vector to token-level attention, allowing specific textual phrases (e.g., “∠AEB”) to attend to more localized visual regions, thereby improving grounding precision.
Gaps in Domain-Specific Terminology: The general-purpose pre-trained language model (DistilBERT) sometimes lacks a deep, axiomatic understanding of specialized geometric terms. For instance, while it may have learned to associate the phrase “angle bisector” with equal angles from training data, it may not inherently know that an “altitude” implies a perpendicular relationship unless the word “perperpendicular” is explicitly mentioned alongside it. This limitation suggests that the model’s knowledge is more correlational than causal, a point raised by the reviewer. This indicates a potential failure point on problems requiring deep, intrinsic mathematical knowledge. This could be mitigated by pre-training or fine-tuning the language model on a large corpus of mathematical and geometric texts (e.g., textbooks, research papers), enabling it to build a more robust, axiomatic understanding of domain-specific concepts.
Scalability on Complex Diagrams: The model’s performance was primarily evaluated on datasets with a moderate number of primitives. Its scalability and robustness on extremely complex or dense diagrams, where visual noise and primitive overlap are significantly higher, have not been fully benchmarked. In such scenarios, the global semantic vector might lack the specificity needed to resolve fine-grained local ambiguities, posing a challenge for both the SGCA mechanism and the subsequent GNN reasoning.

These failure cases underscore that while our semantic-guided approach is a major step forward, the path to a fully robust geometric solver requires advancements in multi-step textual reasoning, more precise grounding mechanisms, and deeper integration of domain-specific knowledge.

5. Conclusions and Future Work

This research confronts the critical challenge of the semantic gap in Automated Geometric Problem Solving. By designing a novel dual-stream architecture centered on a Semantic-Guided Cross-Attention (SGCA) mechanism, we have demonstrated that leveraging global textual semantics to guide local visual parsing is a highly effective strategy. Our model establishes a new paradigm where reasoning occurs over a semantically-enriched graph, leading to state-of-the-art performance in relation parsing and downstream problem-solving. The core contribution is the validation that future intelligent geometry solvers must evolve into ‘dual experts,’ proficient in both visual and linguistic reasoning, moving beyond the performance ceiling of single-modality systems.

The core insight of this research is that the progression of intelligent geometry solvers necessitates their evolution into ‘dual experts,’ proficient in both visual and linguistic reasoning. Our findings suggest that single-modality systems are fundamentally constrained and will inevitably encounter a performance ceiling. The principle we propose—“leveraging global textual semantics to guide local visual parsing”—not only provides an effective framework for building such collaborative multimodal reasoning systems but also holds potential for transfer to other domains. Our approach is particularly adept at handling various forms of geometric symmetry, such as equality constraints stated in text or implied by symbols, as a key aspect of this collaborative reasoning. This principle is transferable to other STEM fields. For example, in physics, textual context can clarify forces in a diagram. In engineering, technical documentation can help assign correct states to symbols in complex drawings (e.g., P&IDs), which is critical for automated verification. Similarly, in flowchart parsing, textual descriptions could resolve structurally ambiguous conditional branches.

Despite its notable success, this study highlights several promising directions for future research. At the methodological level, the current general-purpose language model (DistilBERT) has limitations in understanding specialized terms like “foot of a perpendicular,” and the diagram-text fusion mechanism has room for further optimization. Future work could focus on developing or fine-tuning domain-specific language models, such as those pre-trained on mathematical or geometric corpora [36], to better comprehend specialized terminology (e.g., ‘altitude,’ ‘bisector’) axiomatically rather than correlationally. Additionally, exploring more sophisticated bidirectional or iterative fusion strategies could achieve deeper cross-modal interaction. At the application and ecosystem level, we advocate for the creation of next-generation benchmark datasets to drive continued progress. These datasets should move beyond existing scopes to include problems requiring multi-step textual logical reasoning, more diverse visual–textual ambiguities, and even external knowledge [37,38]. Furthermore, testing on larger and noisier datasets will be crucial for evaluating model scalability and robustness. Such efforts will propel models from the current stage of ‘diagram-text parsing’ toward a higher level of ‘multimodal knowledge reasoning,’ laying the foundation for a truly universal automated geometric problem solver.

Author Contributions

Conceptualization, P.J. and B.M.; methodology, P.J., X.Z. and B.M.; software, X.Z.; validation, X.Z. and P.J.; formal analysis, X.Z.; investigation, X.Z. and L.W.; resources, P.J. and W.H.; data curation, X.Z., L.W. and W.H.; writing—original draft preparation, X.Z.; writing—review and editing, P.J.; visualization, X.Z.; supervision, P.J.; project administration, P.J. and W.H.; funding acquisition, P.J. and B.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (No. 62107014), the Youth Talent Support Project of Henan Province (2023HYTP046).

Data Availability Statement

The data presented in this study are available within the article. The datasets analyzed during the current study (PGDP5K and IMP-Geometry3K) are publicly available resources and are cited accordingly in the manuscript.

Conflicts of Interest

Author Wangyang Hong was employed by the company Zhongshui Culture Technology (Zhengzhou) Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Chen, J.; Tang, J.; Qin, J.; Liang, X.; Liu, L.; Xing, E.; Lin, L. GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 4887–4898. [Google Scholar]
Seo, M.J.; Hajishirzi, H.; Farhadi, A.; Etzioni, O.; Malcolm, C. Solving geometry problems: Combining text and diagram interpretation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1456–1466. [Google Scholar]
Kiryati, N.; Eldar, Y.; Bruckstein, A.M. A probabilistic Hough transform. Pattern Recognit. 1991, 24, 303–316. [Google Scholar] [CrossRef]
Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
von Gioi, R.G.; Jakubowicz, J.; Morel, J.-M.; Randall, G. LSD: A fast line segment detector with a false detection control. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 722–732. [Google Scholar] [CrossRef] [PubMed]
Zhang, M.L.; Yin, F.; Hao, Y.H.; Liu, C.L. Plane geometry diagram parsing. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, Vienna, Austria, 23–29 July 2022; pp. 1636–1643. [Google Scholar]
Jian, P.; Guo, F.; Pan, C.; Wang, Y.; Yang, Y.; Li, Y. Interpretable Geometry Problem Solving Using Improved RetinaNet and Graph Convolutional Network. Electronics 2023, 12, 4578. [Google Scholar] [CrossRef]
Yu, X.; Cheng, W.; Yang, C.; Zhang, T. A theoretical review on solving algebra problems. Expert Syst. Appl. 2026, 296, 128789. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Xu, D.; Zhu, Y.; Choy, C.B.; Fei-Fei, L. Scene graph generation by iterative message passing. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5410–5419. [Google Scholar]
Wu, J.; Zhao, Z.; Sun, Y. Uni-Geo: A unified generative framework for geometric problem solving. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; pp. 8468–8482. [Google Scholar]
Huang, L.; Yu, X.; Niu, L.; Feng, Z. Solving Algebraic Problems with Geometry Diagrams Using Syntax-Semantics Diagram Understanding. Comput. Mater. Contin. 2023, 77, 343–360. [Google Scholar] [CrossRef]
Lu, P.; Gong, R.; Jiang, S.; Qiu, L.; Huang, S.; Liang, X.; Zhu, S.-C. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; pp. 5946–5958. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Peng, R.; Lyu, X.; Yu, X. Arithmetic Problem Solver Based on BERT Model and Mathematical Cognitive Pattern. In Proceedings of the 2021 IEEE International Conference on Teaching, Assessment, and Learning for Engineering (TALE), Wuhan, China, 8–11 December 2021; pp. 834–837. [Google Scholar] [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A robustly optimized BERT pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
He, P.; Liu, X.; Gao, J.; Chen, W. DeBERTa: Decoding-enhanced BERT with disentangled attention. arXiv 2020, arXiv:2006.03654. [Google Scholar]
Zhang, Y.; Zhou, G.; Xie, Z.; Ma, J.; Huang, J.X. A diversity-enhanced knowledge distillation model for practical math word problem solving. Inf. Process. Manag. 2025, 62, 104059. [Google Scholar] [CrossRef]
He, B.; Yu, X.; Huang, L.; Meng, H.; Liang, G.; Chen, S. Comparative study of typical neural solvers in solving math word problems. Complex Intell. Syst. 2024, 10, 1454. [Google Scholar] [CrossRef]
Antol, S.; Agrawal, A.; Lu, J.; Antol, S.; Mitchell, M.; Zitnick, L.; Batra, D.; Parikh, D. VQA: Visual question answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar]
Tan, H.; Bansal, M. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 5100–5111. [Google Scholar]
Chen, Y.C.; Li, L.; Yu, L.; Kholy, A.E.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. UNITER: Universal image-text representation learning. In Proceedings of the Computer Vision—ECCV 2020: Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 104–120. [Google Scholar]
Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining for vision-and-language tasks. In Proceedings of the Advances in Neural Information Processing Systems 32, Vancouver, BC, Canada, 8–14 December 2019; pp. 1–12. [Google Scholar]
Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai, J. VL-BERT: Pre-training of generic visual-linguistic representations. arXiv 2019, arXiv:1908.08530. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Jian, P.; Guo, F.; Wang, Y.; Li, Y. Solving Geometry Problems via Feature Learning and Contrastive Learning of Multi-Modal Data. Comput. Model. Eng. Sci. 2023, 136, 1707–1728. [Google Scholar]
Ma, B.; Jian, P.; Pan, C.; Wang, Y.; Ma, W. A geometric neural solving method based on a diagram text information fusion analysis. Sci. Rep. 2024, 14, 3906. [Google Scholar] [CrossRef] [PubMed]
Huang, L.; Yu, X.; Xiong, F.; He, B.; Tang, S.; Fu, J. Hologram reasoning for solving algebra problems with geometry diagrams. arXiv 2024, arXiv:2408.10592. [Google Scholar] [CrossRef]
Ming, H.; Yu, X.; Cheng, X.; Shen, Z.; Lyu, X. A Compact Model for Mathematics Problem Representations Distilled from BERT. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 24867–24875. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Yu, F.; Tang, L.; Zhang, Q. Geometry-aware graph neural networks for molecular property prediction. Nat. Commun. 2022, 13, 838. [Google Scholar]
Zhang, Q.; Weng, X.; Zhou, G.; Zhang, Y.; Huang, J.X. ARL: An adaptive reinforcement learning framework for complex question answering over knowledge base. Inf. Process. Manag. 2022, 59, 102933. [Google Scholar] [CrossRef]
Peng, S.; Yuan, K.; Gao, L.; Tang, Z. Math-BERT: A pre-trained model for mathematical formula understanding. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 175–185. [Google Scholar]
Amini, A.; Gabriel, S.; Lin, S.; Koncel-Kedziorski, R.; Choi, Y.; Hajisjirzi, H. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 2957–2967. [Google Scholar]
Tang, K.; Jia, K.; He, J. Look, read and reason: A visual-textual fusion framework for document-level relation extraction. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 1776–1784. [Google Scholar]

Figure 1. The overall architecture of our proposed dual-stream fusion model. The model takes a diagram-text pair as input. The visual pathway (left) uses a CNN-based parser (PGDPNet front-end) to detect geometric primitives (e.g., points, lines) and extract their feature vectors. The textual pathway (right) employs DistilBERT to encode the problem text into a single global semantic vector. Our core Semantic-Guided Cross-Attention (SGCA) mechanism uses this text vector as a query to re-weight the visual features, producing context-enriched representations. These are then fed into a Graph Neural Network (GNN) for relational reasoning, yielding the final structured output.

Figure 2. Training loss convergence curves for our model versus the PGDPNet baseline. The x-axis represents training epochs, and the y-axis represents the total loss. Our model (red line) demonstrates a faster convergence rate and achieves a lower, more stable loss in the later stages of training compared to the baseline (gray line), indicating superior learning efficiency and generalization.

Figure 3. Qualitative analysis of our model’s ability to resolve visual-textual ambiguity. The figure presents a challenging case where two angles, ∠PQR and ∠PQS, are visually similar. The model must use the textual constraint to make the correct identification. (a) The problem setup, which includes the geometric diagram and the textual statement “The measure of ∠PQR is 124°”. (b) The baseline model, relying primarily on visual cues, fails by incorrectly identifying the visually plausible but incorrect angle ∠PQS. (c) Our proposed model successfully grounds the textual constraint, overriding the ambiguous visual information to correctly identify ∠PQR as 124°.

Table 1. Comparison of relationship parsing results on PGDP5K.

Model	Inter-GPS	LXMERT (Adapted)	PGDPNet (Baseline)	Our Method
Geo2Geo F1 (%)	98.6	98.5	98.8	98.9
Text2Geo F1 (%)	97.6	97.8	98.0	98.5
Sym2Geo F1 (%)	96.5	96.8	97.1	97.7
Text2Head F1 (%)	94.7	95.8	96.5	97.2
Full Rel. Acc. (%)	81.5	81.8	83.2	84.8

Table 2. Downstream task performance comparison (PGDP5K).

Model	Inter-GPS	LXMERT (Adapted)	PGDPNet (Baseline)	Our Method
Full Rel. Acc. (%)	81.5	81.8	83.2	84.8
Proposition Gen. (%)	78.5	79.1	80.6	84.7
Problem Solving (%)	70.3	71.0	72.5	75.1

Table 3. Model efficiency comparison. The ▲ symbol indicates the percentage increase relative to the PGDPNet (Baseline).

Model	Full Rel. Acc. (%)	▲ Accuracy (%)	Parameters (M)	▲ Parameters (%)
PGDPNet (Baseline)	83.2	(Baseline)	12.4	(Baseline)
Our Method (w/DistilBERT)	84.8	+1.6%	78.6	+534%
Our Method (w/BERT-base)	85.0	+1.8%	122.1	+885%

Table 4. Ablation study results (PGDP5K).

Ablation Configuration	Full Rel. Acc. (%)	Improvement
PGDPNet (Baseline)	83.2	(Baseline)
Baseline + Simple Text Fusion	83.8	+0.6
Baseline + SGCA	84.3	+1.1
Baseline + EGAT	84.2	+1.0
Our Full Model	84.8	+1.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jian, P.; Zhang, X.; Wu, L.; Ma, B.; Hong, W. Geometry Diagram Parsing and Reasoning Based on Deep Semantic Fusion. Symmetry 2026, 18, 92. https://doi.org/10.3390/sym18010092

AMA Style

Jian P, Zhang X, Wu L, Ma B, Hong W. Geometry Diagram Parsing and Reasoning Based on Deep Semantic Fusion. Symmetry. 2026; 18(1):92. https://doi.org/10.3390/sym18010092

Chicago/Turabian Style

Jian, Pengpeng, Xuhui Zhang, Lei Wu, Bin Ma, and Wangyang Hong. 2026. "Geometry Diagram Parsing and Reasoning Based on Deep Semantic Fusion" Symmetry 18, no. 1: 92. https://doi.org/10.3390/sym18010092

APA Style

Jian, P., Zhang, X., Wu, L., Ma, B., & Hong, W. (2026). Geometry Diagram Parsing and Reasoning Based on Deep Semantic Fusion. Symmetry, 18(1), 92. https://doi.org/10.3390/sym18010092

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Geometry Diagram Parsing and Reasoning Based on Deep Semantic Fusion

Abstract

1. Introduction

2. Related Work

2.1. Geometric Diagram Parsing

2.2. Text Understanding in Geometry Problems

2.3. Multimodal Fusion

3. Methodology

3.1. Overall Architecture

3.2. Modality-Specific Feature Encoding

3.2.1. Visual Pathway: Primitive Feature Extraction

3.2.2. Textual Pathway: Global Semantic Encoding

3.3. Semantic-Guided Multimodal Fusion

3.4. Graph-Based Relational Reasoning

4. Experiments and Analysis

4.1. Datasets, Metrics, and Experimental Setup

4.2. Main Results and Comparison

4.3. Efficiency and Ablation Analysis

4.4. Qualitative and Robustness Analysis

4.5. Error Analysis

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI