Next Article in Journal
Trust-Regulated Dependence Networks for Multi-Agent Cooperation: A Simulation Study
Previous Article in Journal
Stable Diffusion-Driven Semantic Coding Method for Image Transmission Under Low SNR Conditions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Text-Guided Geometric Relation Parsing with Logic Regularization

Information Engineering Institute, North China University of Water Resources and Electric Power, Zhengzhou 450046, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(11), 2460; https://doi.org/10.3390/electronics15112460
Submission received: 14 May 2026 / Revised: 2 June 2026 / Accepted: 2 June 2026 / Published: 4 June 2026

Abstract

Geometric relation parsing is a prerequisite for automated geometry problem solving, especially when diagram interpretation depends jointly on visual appearance and textual conditions. In this study, we examine a text-conditioned parsing setting derived from PGDP5K and propose a lightweight parser with atomic cue extraction, iterative visual–semantic feedback, and differentiable logic regularization. Because the active high-level labels are derived through a rule-based weak-supervision protocol, the results should be interpreted as parser-level evidence under Ext-PGDP5K rather than proof of general geometric semantic understanding. The nominal label space contains five candidate relations, while the current evaluation focuses on four active relations with positive instances: Intersect, Parallel, Perpendicular, and Bisect. Compared with text-only, image-only, global-fusion, and shuffled-text controls, the proposed parser improves Edge-F1 and Macro-F1, with the clearest gains for Parallel and Perpendicular. Ablations show that the atomic probe is the main source of improvement, while logic regularization and feedback exhibit non-monotonic interactions. Although limited by weak labels, lexical cues, and the absence of downstream solver validation, this study provides a reproducible protocol-aligned testbed for analyzing text-conditioned relation prediction and low-order logic regularization in geometric diagram parsing.

1. Introduction

1.1. Background and Challenges

Automated Geometric Problem Solving (AGP) is a representative benchmark for multimodal mathematical intelligence because it requires the joint use of visual perception, language processing, structural abstraction, and symbolic reasoning [1]. Compared with ordinary visual question answering, geometry problem solving is unusually sensitive to small perceptual mistakes: a minor error in detecting whether two lines are parallel or whether a line bisects an angle can invalidate an entire reasoning chain. For this reason, the front-end parsing stage is not a minor preprocessing step but a bottleneck that often determines whether downstream reasoning can succeed at all. Within the AGP pipeline, geometric diagram parsing converts an image–text problem instance into a structured representation that downstream models can operate on [2]. This representation must capture both visual primitives and their semantic relations. In practice, however, visual appearance alone is often insufficient. Lines that appear nearly parallel may in fact intersect; an otherwise ordinary segment may be explicitly defined in the text as an angle bisector; and auxiliary constructions may exist only implicitly in the written description. These characteristics make geometry a particularly challenging multimodal domain.

1.2. Current Research Status and Limitations

Existing work can be divided into parser-oriented studies and solver-oriented studies. Parser-oriented models, exemplified by PGDPNet, focus on detecting primitives and predicting structured diagram relations [3]. Although modern parser-oriented models have made clear progress beyond hand-crafted rule systems, three limitations remain. First, most methods are still dominated by local visual evidence and therefore make limited use of explicit textual constraints. Second, text is often injected only as a global context vector rather than grounded in relation-specific decisions. Third, predicted graphs are rarely regularized by geometric axioms during training, which leaves the model free to produce structurally inconsistent outputs. These limitations motivate us to reinterpret geometric relation parsing as a text-conditioned, logic-aware structured prediction problem.

1.3. Motivation and Approach

Textual conditions are useful because they often specify relations that are visually ambiguous or only weakly indicated in the diagram. Against this background, our design separates three roles: textual atoms provide explicit but potentially biased semantic cues, visual relation reasoning grounds these cues to diagram primitives, and logic regularization encourages low-order structural consistency in the predicted graph. This intuition is illustrated in Figure 1.

1.4. Proposed Method and Contributions

Building on this motivation, we study a text-conditioned parser-level relation prediction setting that combines weak atomic cue extraction, iterative visual–semantic feedback, and differentiable logic consistency regularization. The main contributions of this study are threefold: first, we formulate a derived Ext-PGDP5K parser-level protocol for text-conditioned geometric relation prediction; second, we design a lightweight parser that combines atomic lexical cue extraction with visual–semantic feedback; third, we incorporate low-order differentiable logic regularizers to analyze local structural consistency. Although this setting is deliberately narrower than end-to-end geometry problem solving, it is useful because it isolates relation-level errors that may be hidden in solver-level evaluation and provides a controlled protocol for studying how paired text and low-order logic constraints affect geometric relation prediction. The task/protocol formulation is new to the present study, whereas the multimodal fusion and logic-regularization components should be understood as adaptations of existing multimodal and neuro-symbolic learning ideas to this parser-level relation-prediction setting. It is important to note that the present study is not intended to replace solver-level geometry problem solving benchmarks. Instead, it focuses on a parser-level setting in which the goal is to improve the prediction of high-level geometric relations over candidate primitives. This scope allows us to isolate the effect of text-conditioned cues and low-order logic regularization before connecting the parsed relation graph to downstream theorem-guided solvers. Accordingly, the reported results should be interpreted as parser-level evidence under the derived Ext-PGDP5K protocol. We do not claim solver-level superiority over geometry problem solving systems that optimize final numerical answers, formal proofs, or complete solution programs. Instead, the goal is to examine whether text-conditioned cues and low-order logic regularization improve relation-level graph prediction over a shared candidate primitive setting. The strongest current evidence is concentrated in text-sensitive line–line relations, especially Parallel and Perpendicular; therefore, the results should not be interpreted as broadly solving all geometric relation categories.

2. Related Work

2.1. Geometric Diagram Parsing and Automated Problem Solving

Research on geometry understanding has progressed along two lines: diagram parsing (detection and inference of relations) and full problem solving (symbolic reasoning and proof programs) [4]. Robust parsers must resolve ambiguous cues, preserve relational structure, and expose outputs that remain compatible with formal reasoning. FGeo-Parser also emphasizes the role of autoformalization in plane geometric problem solving [5]. For solver-oriented systems, the final objective is usually the numerical answer, proof trace, or formalized solution program. Recent solver-oriented systems have demonstrated that formal reasoning and learned search can solve challenging geometry problems when reliable symbolic representations are available [6]. Inter-GPS further shows that formal language and symbolic reasoning can make geometry problem solving more interpretable [7]. LANS introduces layout-aware neural solving for plane geometry problems, highlighting the importance of diagram layout in solver-oriented pipelines [8]. In contrast, parser-oriented studies evaluate whether the diagram and text have been converted into a reliable intermediate representation. This distinction is important for the present work because relation-level errors may be hidden in an end-to-end solver if the final answer happens to remain correct. We therefore evaluate Edge-F1, Macro-F1, and logic-violation behavior directly at the relation-parsing stage. Other recent work has explored solving geometry problems with parsed clauses extracted from diagrams [9]. AutoGPS represents another solver-oriented direction that combines multimodal formalization with deductive reasoning [10]. Diagram formalization has also been used to enhance multimodal geometry problem solving systems [11]. Recent work on Euclidean geometry formalization further indicates that structured intermediate representations are important for reliable theorem-level reasoning [12]. This parser-level perspective also determines the choice of baselines. Existing solver-oriented systems usually optimize different outputs, such as formalized clauses, theorem-guided proof traces, or final numerical answers. Directly comparing such systems with a relation parser would confound relation prediction with downstream solving. Therefore, this study uses protocol-aligned controlled baselines, including text-only, image-only, global-fusion, shuffled-text, and component-level ablations, so that all methods are evaluated under the same candidate primitive and relation-label setting. Although stronger solver-oriented systems are discussed above, they are not used as direct main baselines because their outputs are final answers, formal clauses, or proof traces rather than relation-level edge predictions. A direct comparison would therefore mix parser quality with downstream solving ability. In this work, the main comparison is restricted to protocol-aligned parser variants under the same candidate primitive pairs and active relation labels. The controlled baselines used in this study are intended to isolate the effects of modality, paired text, atomic cue extraction, feedback, and logic regularization under a shared candidate-edge protocol. They do not establish competitiveness with stronger external parser-oriented systems. Re-implementing PGDPNet-style parser baselines or connecting the parser to solver-level systems under a shared evaluation interface remains an important direction for future comparison.

2.2. Multimodal Fusion in the Geometric Domain

Multimodal fusion is delicate in geometric diagrams because the visual domain is sparse and precision-sensitive. IconQA further shows that abstract diagrams require visual–language reasoning beyond ordinary image understanding [13]. GeoQA illustrates that geometry question answering requires coordinated reasoning over textual conditions, diagrams, and numerical structures [14]. A recurring weakness of many fusion strategies is that text is used only globally. Vision–language pretraining methods such as LXMERT demonstrate the effectiveness of cross-modal representation learning in general multimodal tasks [15]. mPLUG improves vision–language learning through cross-modal skip connections, but such general fusion designs do not directly address relation-specific geometric grounding [16]. Our atomic semantic probe is motivated by the observation that identifying relation-relevant textual atoms can guide cross-modal attention more effectively than a single global vector. However, text-conditioned modeling also introduces the risk of language-prior dependence [17]. If a relation label can be partially inferred from frequently occurring phrases or annotation artifacts, a model may appear multimodal while relying only weakly on diagram grounding. Shortcut learning can lead models to rely on superficial correlations rather than robust multimodal grounding [18]. For this reason, modality-control experiments are necessary. In addition to text-only and image-only baselines, the shuffled-text setting used in this study tests whether the parser benefits from correctly paired textual cues rather than arbitrary textual priors or dataset-level language shortcuts.

2.3. Neuro-Symbolic Learning and Logic Regularization

Neuro-symbolic learning seeks to combine neural flexibility with symbolic rigor [19]. Semantic loss further formalizes how symbolic knowledge can be encoded as differentiable learning signals [20]. Our work follows an objective-level path, encoding symmetry, transitivity, and mutual exclusivity as soft constraints so that the parser is penalized when its predicted relation graph drifts away from structurally admissible configurations. Logic rules have been used to guide neural models through regularization-style learning objectives [21]. Semantic-based regularization provides a related framework for incorporating symbolic constraints into learning objectives [22]. The logic component in our model should be understood as soft regularization rather than complete symbolic theorem proving. DL2 also explores training and querying neural networks with logical constraints [23]. Probabilistic soft logic provides another example of using continuous relaxations for rule-based reasoning [24]. Symmetry, transitivity, and mutual exclusivity provide local structural biases during training, but they do not guarantee a globally consistent geometric proof. This distinction is important because the empirical results later show a trade-off between predicting more positive relations and increasing rule-defined conflicts.
Neural Logic Machines study how relational and logical patterns can be learned with neural architectures [25]. DeepProbLog combines neural perception with probabilistic logic programming [26]. Logic Tensor Networks connect logical reasoning with tensor-based neural learning [27]. Thus, the role of logic in this work is diagnostic and regularizing rather than proof-complete: it provides measurable local conflict signals through LVR, but it does not certify global geometric validity.

3. Materials and Methods

3.1. Overall Architecture

Task definition. Given an input pair I , T , where I is the diagram image and T is the aligned problem text, our goal is to predict a relation graph over detected primitives. The nominal label space is R n o m = { I n t e r s e c t , T a n g e n t , P a r a l l e l , P e r p e n d i c u l a r , B i s e c t } . In the current derived split, Tangent has no positive instances, so the active evaluation set is R a c t = { I n t e r s e c t , P a r a l l e l , P e r p e n d i c u l a r , B i s e c t } . Let E denote the candidate edge set and let P [ 0,1 ] N × N × | R a c t | denote the predicted relation tensor, where P i j r is the probability that relation r holds between primitives i and j .
The Ext-PGDP5K protocol is derived from the official PGDP5K split by constructing candidate primitive pairs and assigning high-level relation labels through rule-based parsing of diagram primitives and aligned textual cues. This rule-based label construction is related to weak supervision, in which labeling functions are used to generate training signals at scale [28]. Therefore, the model is trained on weak labels generated by the derived protocol rather than independently verified human semantic annotations. The derived labels, split identifiers, and label-derivation scripts are released with the accompanying project materials. Because textual cues are used both in label derivation and as model input, potential text-prior leakage is evaluated through text-only and shuffled-text controls and further discussed as a validity limitation in Section 4.6.
The overall framework contains three tightly coupled stages: multimodal atomic perception, iterative visual–semantic feedback fusion, and logic-regularized graph reasoning. Figure 2 provides a high-level overview, and Algorithm 1 summarizes the training and inference pipeline. The central idea is to expose explicit semantic cues first, use them to guide visual relation reasoning, and then regularize the resulting graph with geometric structure.
Algorithm 1. Training and inference pipeline of the proposed parser
Input: training set D , active relation set R a c t , atomic rule set A , model parameters θ
Output: trained parser F θ and predicted relation graph G ^
Training phase:
  • For each sample (I,T) in D, detect primitives and extract node features V.
  • Construct candidate edge set E and generate weak atomic labels z from T using A.
  • Encode T with DistilBERT to obtain token features H and atomic probabilities a.
  • Project a to the initial semantic query q^((0)) and apply semantic-guided cross-attention over V.
  • Run graph reasoning to obtain relation logits S^((1)) and probabilities P^((1)).
  • Compute feedback f^((1))=MLP_fb (Pool(S^((1)))) and update q^((1))=LayerNorm(q^((0))+f^((1))).
  • Re-apply cross-attention and graph reasoning to obtain refined logits S^((2)) and probabilities P^((2)).
  • Compute L_sup, L_atom, L_sym, L_trans, and L_mutex, then update θ with AdamW.
Inference phase:
9.
Given a test pair (I,T), detect primitives, encode text, and obtain q^((0)).
10.
Perform two rounds of semantic-guided attention and graph reasoning.
11.
Threshold the final relation probabilities to produce G ^.

3.2. Multimodal Atomic Perception

  • Visual Stream: We adopt ResNet-50 as the visual backbone [29]. An FPN detector is used to support multi-scale primitive perception [30]. In the present study, relation prediction is evaluated on the derived candidate primitives provided by the PGDP5K-based preprocessing pipeline, as recorded in the Ext-PGDP5K split files.
  • Candidate edges are constructed from these preprocessed primitives according to the Ext-PGDP5K split files. All valid candidate primitive pairs recorded in the Ext-PGDP5K split files are retained for relation prediction. We do not perform additional negative sampling; candidate edge–relation entries without derived positive labels are treated as negative instances in the masked binary relation loss. This setting leads to severe class imbalance, so FRA is reported only as an auxiliary metric, while Edge-F1 and Macro-F1 are emphasized for relation-level evaluation. The reported metrics are computed on the resulting candidate primitive pairs; therefore, the evaluation focuses on relation parsing rather than standalone primitive detection. This design isolates relation-level prediction errors from primitive-detection errors.
  • Text Stream: Explicit Atomic Semantic Probe: Instead of encoding the full problem statement into a single undifferentiated sentence vector, we explicitly model a compact set of geometric atoms. The seed vocabulary contains six cue families centered on parallel, perpendicular, tangent, bisector, angle-bisector, and intersection semantics. The atomic weak labels are generated by matching normalized cue expressions to these predefined cue families. This procedure is closer to lexical cue extraction than to full natural-language semantic parsing. The vocabulary includes textual forms, symbolic forms, and common variants, such as “parallel”, “//”, and “||” for the Parallel atom. The cue vocabulary and normalization rules are available in the project repository. For example, ‘parallel’, ‘//’, and ‘||’ are normalized to the Parallel atom. This design makes the text stream efficient and interpretable, but it may be brittle to paraphrases, implicit relation descriptions, and syntactic forms not covered by the seed vocabulary or normalization rules.
The DistilBERT encoder is kept frozen in all experiments [31]. Only the task-specific projection, attention, feedback, and edge-classification modules are trained. This setting keeps the trainable parameter count low and ensures that the reported parameter numbers reflect the parser components rather than the pretrained language encoder.

3.3. Iterative Visual–Semantic Feedback Fusion

A single forward pass is often insufficient for geometry parsing because the most informative visual evidence may become apparent only after the model forms an initial relational hypothesis. We therefore adopt a two-round iterative feedback mechanism. Two rounds are used as the default setting because the validation-set and test-set sensitivity analyses show that this configuration provides the best trade-off between relation-prediction performance and computational cost. Additional feedback rounds are analyzed in the sensitivity experiment rather than assumed to be monotonically beneficial.
f t = M L P f b ( P o o l ( S t ) ) , q t 1 = L a y e r N o r m ( q t + f t )

3.4. Logic Consistency Regularization

The following logic terms are used as differentiable soft regularizers during training. They are not hard constraints and do not guarantee theorem-level consistency of the final graph. Their purpose is to bias the parser away from local rule violations and to provide a measurable conflict signal through LVR. These regularizers are used as auxiliary soft losses and should not be interpreted as a guarantee of global geometric validity.
Let y i j r denote the binary supervision label for relation r on candidate edge ( i , j ) , and let z c denote the weak label for the c -th atomic cue family. We define the supervised relation loss and the atomic probe loss as follows:
L s u p = 1 Ω i , j , r Ω   y i j r log P i j r + 1 y i j r log 1 P i j r
L a t o m = 1 C c = 1 C   z c log a c + 1 z c log 1 a c
For undirected relations such as Parallel, Perpendicular, and Intersect, the symmetry loss is defined as:
L s y m = 1 E | | R s y m i , j E   r R s y m   ( P i j r P j i r ) 2
The transitivity term is computed over sampled local triplets from the candidate edge graph rather than over all possible primitive triples. This sampling strategy reduces computational cost and keeps the regularizer focused on local graph consistency.
Over sampled triplets T t r i , we use the Lukasiewicz-style relaxation for transitivity:
L t r a n s = 1 T t r i ( i , j , k ) T t r i m a x 0 , P i j p a r a + P j k p a r a P i k p a r a 1
The mutual-exclusivity loss is defined as:
L m u t e x = 1 E | | M e x c l ( i , j ) E   ( r a , r b ) M e x c l   P i j r a P i j r b
The final training objective is:
L = L s u p + λ a t o m L a t o m + λ l o g i c L s y m + L t r a n s + L m u t e x
In the reported implementation, λ a t o m was set to 0.2 and λ l o g i c was set to 0.1. The same values were used across the controlled variants unless the corresponding component was ablated.

3.5. Design Rationale and Computational Discussion

The logic layer is intentionally low-order rather than theorem-complete: the current Ext-PGDP5K labels do not provide the typed supervision needed for stronger angle-, ratio-, or polygon-level constraints.

4. Results and Discussion

4.1. Experimental Design and Evaluation Protocol

Dataset and task setting. We use the official PGDP5K split, consisting of 3500 training samples, 500 validation samples, and 1000 test samples. The nominal label space contains five candidate relations, but the current main evaluation is defined over four active relations with positive instances: Intersect, Parallel, Perpendicular, and Bisect. Tangent is retained as an audited candidate relation but excluded from the current quantitative evaluation. We report Edge-F1, Macro-F1, Full Relation Accuracy (FRA), and Logic Violation Rate (LVR). Edge-F1 is computed as micro-F1 over valid edge–relation predictions, whereas Macro-F1 averages F1 across the four active relations. FRA measures graph-level exact matching and is therefore reported as an auxiliary metric because positive high-level relations are sparse. LVR measures the proportion of predicted edges or local triplets that violate predefined symmetry, transitivity, or mutual-exclusivity rules. Edge-F1, Macro-F1, FRA, and LVR are reported in percentage form unless otherwise specified. The prediction threshold is selected on the validation split by maximizing Macro-F1 and is then fixed for all test-set evaluations. All reported comparisons are conducted within the same derived Ext-PGDP5K candidate-edge protocol. This controlled setting is not intended to replace broader solver-level benchmarks; rather, it is designed to make modality effects, text pairing, threshold behavior, and local consistency conflicts directly observable at the relation-graph level. They should not be interpreted as direct comparisons with solver-oriented systems whose outputs are final answers, proof traces, or formal solution programs. The absence of independently annotated relation labels and downstream solver-level evaluation is therefore treated as a limitation rather than resolved by the present protocol. The experiments were implemented in Python using PyTorch, with the released dependency file specifying PyTorch 2.2 or later; the project code is available at https://github.com/youger-zero/atom-main (accessed on 13 May 2026).

4.2. Protocol Statistics

Table 1 shows that the derived relation labels are highly sparse. Although the dataset contains more than half a million primitive pairs, the number of positive high-level relations is much smaller, with fewer than 0.5 active relations per sample on average. This sparsity explains why graph-level FRA should not be interpreted alone: a model may obtain a non-trivial FRA by predicting very few positive relations, while still failing to recover relation-level positives. Therefore, the following comparisons emphasize Edge-F1 and Macro-F1.

4.3. Main Comparison and Modality-Control Analysis

Table 2 shows that the proposed parser achieves the best Edge-F1 and Macro-F1 among the controlled settings. Compared with global fusion, Edge-F1 increases from 30.78% to 53.63%, and Macro-F1 increases from 16.16% to 42.56%. These improvements indicate that the proposed design improves positive-relation prediction rather than merely increasing graph-level matching accuracy. The shuffled-text condition performs far worse than the correctly paired text condition, which suggests that paired textual cues are useful under the current protocol. However, this control does not prove robust natural-language understanding, and residual dependence on lexical priors cannot be excluded. At the same time, the text-only baseline obtains a non-trivial FRA but zero Edge-F1 and Macro-F1, confirming that FRA alone is insufficient under sparse positive labels. Therefore, the main evidence for performance improvement is taken from Edge-F1 and Macro-F1 rather than from FRA alone.

4.4. Relation-Wise Analysis

As shown in Table 3, the relation-wise results reveal uneven behavior across categories, which is important for interpreting the scope of the proposed parser. The proposed model mainly improves Parallel and Perpendicular, which are closely associated with explicit textual cues and line–line geometric constraints. In particular, Parallel increases from 0.00% to 55.70%, and Perpendicular increases from the best baseline value of 23.39% to 60.95%. In contrast, Intersect is still better handled by the global-fusion baseline, and Bisect remains at zero F1. This result indicates that the current model is more effective for text-sensitive line–line relations than for higher-order relations such as bisectors. The weaker Intersect result suggests that not all relation types benefit equally from explicit textual cues, and that some visually grounded relations may already be handled effectively by simpler fusion strategies. The failure on Bisect suggests that the current pairwise edge-classification formulation is insufficient for relations involving higher-order geometric constructions. Unlike Parallel and Perpendicular, which are mainly line–line relations, Bisect often depends on an angle, a segment partition, or a composed construction involving more than two primitives. Therefore, the zero F1 on Bisect should be interpreted as a limitation of the present parser design rather than as a failure of text conditioning alone. This also limits immediate deployment in downstream solvers for problems that require angle-bisector, segment-partition, or composed-construction reasoning.
Illustrative qualitative cases are shown in Figure 3. Cases A-C illustrate the type of grounding behavior expected for Parallel, Perpendicular, and Intersect, whereas Case D summarizes the unresolved higher-order ambiguity of Bisect under the current design.

4.5. Ablation, Efficiency, and Sensitivity Analysis

Table 4 further clarifies the contribution of each component, but the results should not be interpreted as monotonic gains from every component. Instead, they reveal nonlinear interactions among atomic cues, feedback refinement, and logic regularization. The atomic probe is the main source of improvement, increasing Edge-F1 from 27.05% to 46.45% and Macro-F1 from 16.43% to 36.82%. Adding Logic Loss on top of the atomic probe improves Edge-F1 from 46.45% to 52.14% and reduces LVR from 0.116% to 0.086% in the no-feedback setting, but this improvement is not uniform across metrics: Macro-F1 decreases from 36.82% to 29.85%, suggesting that the improvement is concentrated in micro-level prediction and does not translate into a uniform gain in class-balanced performance. Feedback alone reduces both Edge-F1 and Macro-F1, which suggests that preliminary feedback may introduce noisy relational evidence. Therefore, feedback should be viewed as an interacting refinement mechanism rather than as an independently stable contributor. The full model achieves the best Edge-F1, Macro-F1, and FRA, but it also predicts the largest number of positive relations and yields the highest LVR. This supports the interpretation that relation coverage and rule-defined consistency are in tension under the current design. LVR should therefore be interpreted jointly with predicted-positive coverage: a model can obtain a low LVR by predicting very few positive relations, whereas a model that recovers more relation edges may expose more opportunities for rule-defined conflicts. Future work may dynamically tune the logic weight or introduce coverage-aware consistency objectives to balance relation recall and structural validity.
A representative comparison of validation-set LVR trajectories for AP-only and AP+Logic is shown in Figure 4.
Figure 4 illustrates the validation-set training behavior of AP-only and AP+Logic. Model selection is performed using validation Macro-F1 rather than minimum validation LVR. Therefore, this trajectory should not be interpreted as the final test-set LVR of the full model, whose higher LVR is reported in Table 4 and is associated with increased predicted-positive coverage.
Feedback-round sensitivity in Figure 5 additionally reports Macro-F1 to show the relation-level effect of varying the number of feedback rounds.
Table 5 shows that the proposed model remains lightweight in terms of trainable parameters. Inference time is measured on the same hardware after warm-up and averaged over the test split. Small differences below approximately 0.3 ms should be interpreted as measurement variation rather than meaningful speed differences. The reported parameter count excludes frozen pretrained text-encoder parameters and reflects the trainable parser components. Among the feedback settings, two rounds provide the best FRA–efficiency trade-off, achieving 77.8% FRA with an inference time of 8.02 ms per sample. Three rounds reduce LVR to 0.028% but also lower FRA to 74.3%, suggesting that the configuration with the lowest rule-defined conflict rate is not necessarily the configuration with the best prediction performance.

4.6. Limitations and Threats to Validity

The first threat to validity comes from the derived-label protocol itself. Because the present task extends PGDP5K into a text-conditioned parser-level setting, the results should be interpreted within this derived protocol rather than as a replacement for the original primitive-level benchmark. The derived labels make it possible to study high-level relation parsing, but they also introduce dependence on rule-based label construction. Because the derived labels are produced by rule-based parsing, the trained parser may partly learn to approximate the label-derivation pipeline. The present results therefore do not prove general semantic understanding or robustness beyond the adopted protocol.
Another threat comes from text-prior effects and weak supervision in the semantic probe. Although the shuffled-text control suggests that correctly paired text is important, the atomic cues are still derived from surface-level textual patterns. Therefore, the reported improvements should be interpreted as parser-level gains under the adopted protocol, not as evidence that all text-conditioned geometric ambiguity has been solved. Because textual cues participate in both the derived-label protocol and the model input, residual text-prior leakage cannot be excluded. The text-only and shuffled-text controls reduce this concern but do not fully resolve it.
A further limitation concerns relation complexity. Parallel and Perpendicular benefit most clearly from explicit textual cues and low-order geometric constraints, whereas Bisect remains unresolved in the current experiments. This suggests that pairwise relation classification and lexical cue extraction are insufficient for higher-order grounding involving angles, segments, or composed geometric constructions. The atomic probe relies on a seed vocabulary and normalization rules. Although this design is interpretable, it is closer to lexical cue extraction than to full semantic parsing, and may fail under paraphrases or implicit descriptions.
Another limitation is the absence of downstream solver-level validation. The predicted relation graph is intended to provide candidate symbolic constraints, such as Parallel and Perpendicular, for future theorem-guided solvers. However, the present study does not demonstrate improved final-answer accuracy, proof generation, or formal reasoning success.
Finally, the full model predicts more positive relations and obtains better Edge-F1 and Macro-F1, but it also yields a higher LVR. This indicates a trade-off between relation coverage and rule-defined consistency. Future work should investigate stronger grounding mechanisms, improved conflict resolution, and cross-dataset transfer before connecting the parsed relation graph to downstream theorem-guided solvers. A final reproducibility-related limitation is that the derived protocol depends on the correctness of the rule-based label-derivation pipeline. Although the accompanying materials provide derived labels, split identifiers, vocabulary files, and evaluation scripts, independent verification of the label construction remains important for future extensions of the protocol.
The main baselines are protocol-aligned variants designed to isolate component effects. Therefore, they do not establish competitiveness with stronger external parser-oriented baselines or solver-integrated systems. These limitations constrain the scope of the claims, but they also define the intended use of the present work: a reproducible parser-level benchmark and analysis framework for studying text-guided geometric relation prediction before downstream solver integration.

5. Conclusions

5.1. Main Findings

This study investigates geometric relation parsing as a text-conditioned, logic-aware structured prediction problem under a derived Ext-PGDP5K protocol. The proposed parser combines atomic semantic probing, iterative visual–semantic feedback, and low-order logic consistency regularization while keeping the visual and textual backbones lightweight. Under the active four-relation evaluation, the model improves Edge-F1 and Macro-F1 relative to the image-only and global-fusion baselines. The strongest gains are observed for Parallel and Perpendicular, suggesting the value of explicit textual cues for ambiguity-sensitive relation prediction under the current weakly supervised parser-level setting. These improvements should be interpreted within the derived Ext-PGDP5K protocol and should not be taken as evidence of general geometric language understanding. Within this scope, the value of the study lies in providing controlled evidence that paired textual cues and relation-specific cue extraction can improve sparse relation-graph prediction for selected geometric relations.
The revised ablation further shows that the atomic probe is the main source of improvement and that Logic Loss is useful in the no-feedback setting. At the same time, feedback alone is not a stable independent contributor, and the full model shows a higher LVR because it predicts more positive relations. These findings suggest that text-conditioned relation coverage and rule-defined consistency should be optimized jointly rather than treated as automatically aligned objectives. They also show that the current architecture is not uniformly stable across all component combinations. The unresolved Bisect category further suggests that future parsers should incorporate higher-order geometric construction modeling rather than relying only on pairwise edge classification.

5.2. Limitations and Future Work

Future work will focus on constructing reliable Tangent annotations, improving higher-order grounding for Bisect, reducing rule-defined graph conflicts at higher predicted-positive coverage, validating cross-dataset transfer, comparing with stronger external parser baselines, and connecting parsed relation graphs to downstream theorem-guided solvers. Future work should evaluate whether the parsed relation graphs improve downstream solver accuracy, final-answer prediction, proof generation, or theorem-guided reasoning success. In particular, phrase-level grounding and typed geometric construction modeling may be necessary for relations that cannot be represented reliably by pairwise edge classification alone. Future extensions should also replace or supplement seed-vocabulary cue extraction with more robust phrase-level grounding or learned semantic parsing, and should evaluate the protocol with manual label auditing or independently verified relation annotations. Thus, the present work should be viewed as a constrained but reproducible step toward more reliable geometry parsing, rather than as a complete AGP system.

Author Contributions

Conceptualization, X.Z. and P.J.; methodology, X.Z.; software, X.Z.; validation, X.Z., L.W., and P.J.; formal analysis, X.Z.; investigation, X.Z. and L.W.; resources, P.J. and Q.S.; data curation, L.W. and X.Z.; writing—original draft preparation, X.Z.; writing—review and editing, P.J., L.W. and Q.S.; visualization, X.Z. and L.W.; supervision, P.J. and Q.S.; project administration, P.J.; funding acquisition, P.J. and Q.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the General Project of Natural Science Foundation of Henan Province (262300421801) and the Soft Science Project of Henan Province (No. 262400410529).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original PGDP5K dataset is publicly available from the official PGDP repository and dataset page (https://github.com/mingliangzhang2018/PGDP, accessed on 15 May 2026; http://www.nlpr.ia.ac.cn/databases/CASIA-PGDP5K/, accessed on 15 May 2026). The project code and review materials are available at https://github.com/youger-zero/atom-main (accessed on 15 May 2026). A stable archival release will be provided through Zenodo or an equivalent repository upon acceptance.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Ma, J.; Wang, W.; Jin, Q. A Survey of Deep Learning for Geometry Problem Solving. arXiv 2025, arXiv:2507.11936. [Google Scholar] [CrossRef]
  2. Seo, M.; Hajishirzi, H.; Farhadi, A.; Etzioni, O.; Malcolm, C. Solving Geometry Problems: Combining Text and Diagram Interpretation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1466–1476. [Google Scholar] [CrossRef]
  3. Zhang, M.-L.; Yin, F.; Hao, Y.-H.; Liu, C.-L. Plane Geometry Diagram Parsing. arXiv 2022, arXiv:2205.09363. [Google Scholar] [CrossRef]
  4. Lu, P.; Qiu, L.; Yu, W.; Welleck, S.; Chang, K.-W. A Survey of Deep Learning for Mathematical Reasoning. arXiv 2022, arXiv:2212.10535. [Google Scholar] [CrossRef]
  5. Zhu, N.; Zhang, X.; Huang, Q.; Zhu, F.; Zeng, Z.; Leng, T. FGeo-Parser: Autoformalization and Solution of Plane Geometric Problems. Symmetry 2025, 17, 8. [Google Scholar] [CrossRef]
  6. Trinh, T.H.; Wu, Y.; Le, Q.V.; He, H.; Luong, T. Solving Olympiad Geometry without Human Demonstrations. Nature 2024, 625, 476–482. [Google Scholar] [CrossRef] [PubMed]
  7. Lu, P.; Gong, R.; Jiang, S.; Qiu, L.; Huang, S.; Liang, X.; Zhu, S.-C. Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; pp. 6774–6786. [Google Scholar] [CrossRef]
  8. Li, Z.-Z.; Zhang, M.-L.; Yin, F.; Liu, C.-L. LANS: A Layout-Aware Neural Solver for Plane Geometry Problem. In Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 2596–2608. [Google Scholar] [CrossRef]
  9. Zhang, M.-L.; Li, Z.-Z.; Yin, F.; Lin, L.; Liu, C.-L. Fuse, Reason and Verify: Geometry Problem Solving with Parsed Clauses from Diagram. arXiv 2024, arXiv:2407.07327. [Google Scholar] [CrossRef]
  10. Ping, B.; Luo, M.; Dang, Z.; Wang, C.; Jia, C. AutoGPS: Automated Geometry Problem Solving via Multimodal Formalization and Deductive Reasoning. In Proceedings of the Fourteenth International Conference on Learning Representations, Rio de Janeiro, Brazil, 23–27 April 2026; Available online: https://openreview.net/forum?id=PVtZnUh04m (accessed on 15 May 2026).
  11. Zhang, Z.; Cheng, J.-K.; Deng, J.; Tian, L.; Ma, J.; Qin, Z.; Zhang, X.; Zhu, N.; Leng, T. Diagram Formalization Enhanced Multi-Modal Geometry Problem Solver. arXiv 2024, arXiv:2409.04214. [Google Scholar] [CrossRef]
  12. Murphy, L.; Yang, K.; Sun, J.; Li, Z.; Anandkumar, A.; Si, X. Autoformalizing Euclidean Geometry. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; Proceedings of Machine Learning Research. Volume 235, pp. 36847–36893. [Google Scholar]
  13. Lu, P.; Qiu, L.; Chen, J.; Xia, T.; Zhao, Y.; Zhang, W.; Yu, Z.; Liang, X.; Zhu, S.-C. IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning. arXiv 2021, arXiv:2110.13214. [Google Scholar] [CrossRef]
  14. Chen, J.; Tang, J.; Qin, J.; Liang, X.; Liu, L.; Xing, E.P.; Lin, L. GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 513–523. [Google Scholar] [CrossRef]
  15. Tan, H.; Bansal, M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 5100–5111. [Google Scholar] [CrossRef]
  16. Li, C.; Xu, H.; Tian, J.; Wang, W.; Yan, M.; Bi, B.; Ye, J.; Chen, H.; Xu, G.; Cao, Z.; et al. mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 7241–7259. [Google Scholar] [CrossRef]
  17. Agrawal, A.; Batra, D.; Parikh, D.; Kembhavi, A. Don’t Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4971–4980. [Google Scholar] [CrossRef]
  18. Geirhos, R.; Jacobsen, J.-H.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; Wichmann, F.A. Shortcut Learning in Deep Neural Networks. Nat. Mach. Intell. 2020, 2, 665–673. [Google Scholar] [CrossRef]
  19. Besold, T.R.; d’Avila Garcez, A.; Bader, S.; Bowman, H.; Domingos, P.; Hitzler, P.; Kühnberger, K.-U.; Lamb, L.C.; Lowd, D.; Lima, P.M.V.; et al. Neural-Symbolic Learning and Reasoning: A Survey and Interpretation. arXiv 2017, arXiv:1711.03902. [Google Scholar] [CrossRef]
  20. Xu, J.; Zhang, Z.; Friedman, T.; Liang, Y.; Van den Broeck, G. A Semantic Loss Function for Deep Learning with Symbolic Knowledge. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Proceedings of Machine Learning Research. Volume 80, pp. 5502–5511. [Google Scholar]
  21. Hu, Z.; Ma, X.; Liu, Z.; Hovy, E.; Xing, E. Harnessing Deep Neural Networks with Logic Rules. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; pp. 2410–2420. [Google Scholar] [CrossRef]
  22. Diligenti, M.; Gori, M.; Saccà, C. Semantic-Based Regularization for Learning and Inference. Artif. Intell. 2017, 244, 143–165. [Google Scholar] [CrossRef]
  23. Fischer, M.; Balunovic, M.; Drachsler-Cohen, D.; Gehr, T.; Zhang, C.; Vechev, M. DL2: Training and Querying Neural Networks with Logic. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Proceedings of Machine Learning Research. Volume 97, pp. 1931–1941. [Google Scholar]
  24. Bach, S.H.; Broecheler, M.; Huang, B.; Getoor, L. Hinge-Loss Markov Random Fields and Probabilistic Soft Logic. J. Mach. Learn. Res. 2017, 18, 1–67. [Google Scholar]
  25. Dong, H.; Mao, J.; Lin, T.; Wang, C.; Li, L.; Zhou, D. Neural Logic Machines. arXiv 2019, arXiv:1904.11694. [Google Scholar] [CrossRef]
  26. Manhaeve, R.; Dumančić, S.; Kimmig, A.; Demeester, T.; De Raedt, L. DeepProbLog: Neural Probabilistic Logic Programming. arXiv 2018, arXiv:1805.10872. [Google Scholar] [CrossRef]
  27. Serafini, L.; d’Avila Garcez, A. Logic Tensor Networks: Deep Learning and Logical Reasoning from Data and Knowledge. arXiv 2016, arXiv:1606.04422. [Google Scholar] [CrossRef]
  28. Ratner, A.; Bach, S.H.; Ehrenberg, H.; Fries, J.; Wu, S.; Ré, C. Snorkel: Rapid Training Data Creation with Weak Supervision. Proc. VLDB Endow. 2017, 11, 269–282. [Google Scholar] [CrossRef] [PubMed]
  29. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  30. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar] [CrossRef]
  31. Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar] [CrossRef]
Figure 1. Motivation of text-conditioned geometric relation parsing.
Figure 1. Motivation of text-conditioned geometric relation parsing.
Electronics 15 02460 g001
Figure 2. Overview of the proposed text-guided and logic-regularized parser.
Figure 2. Overview of the proposed text-guided and logic-regularized parser.
Electronics 15 02460 g002
Figure 3. Illustrative qualitative cases under the active four-relation setting. These examples are parser-level cases and do not demonstrate downstream solver-level correctness.
Figure 3. Illustrative qualitative cases under the active four-relation setting. These examples are parser-level cases and do not demonstrate downstream solver-level correctness.
Electronics 15 02460 g003
Figure 4. Representative validation-set LVR trajectories for AP-only and AP+Logic during training.
Figure 4. Representative validation-set LVR trajectories for AP-only and AP+Logic during training.
Electronics 15 02460 g004
Figure 5. Sensitivity of the full model to the number of feedback rounds.
Figure 5. Sensitivity of the full model to the number of feedback rounds.
Electronics 15 02460 g005
Table 1. Statistics of the Ext-PGDP5K protocol.
Table 1. Statistics of the Ext-PGDP5K protocol.
SetNPairITPBAvg
Train3500363,87257101766851050.439
Val.50055,91277029100230.458
Test1000110,086175047204530.479
Total5000529,87082302529891810.449
I = Intersect; T = Tangent; P = Parallel; ⊥ = Perpendicular; B = Bisect; Avg = average active relations per sample. Tangent is retained in the nominal label space but has no positive instances in the current derived protocol.
Table 2. Main comparison and modality controls.
Table 2. Main comparison and modality controls.
MethodTextEdge-F1 (%)Macro-F1 (%)FRA (%)LVR (%)
Text-onlyOrig.0.000.0069.00.000
Image-onlyNone27.0516.4371.60.027
Global fusionOrig.30.7816.1672.80.035
Img. + shuf. textShuf.3.672.4270.20.001
OursPaired53.6342.5677.80.244
Edge-F1, Macro-F1, FRA, and LVR are reported in %. Orig. = original text; Shuf. = shuffled text; Paired = correctly paired original text. All methods are protocol-aligned parser variants rather than direct comparisons with solver-oriented systems.
Table 3. Relation-wise F1 under the active four-relation setting.
Table 3. Relation-wise F1 under the active four-relation setting.
Rel.Img.FusionOursΔ
Int.42.3259.9753.60−6.37
Par.0.000.0055.70+55.70
Perp.23.394.6660.95+37.56
Bis.0.000.000.000.00
Macro avg.16.4316.1642.56+26.13
Rel. = Relation; Img. = Image-only; Fusion = Global text fusion; Δ = Ours minus the best baseline. Int. = Intersect; Par. = Parallel; Perp. = Perpendicular; Bis. = Bisect.
Table 4. Revised ablation study of the proposed components.
Table 4. Revised ablation study of the proposed components.
VariantEdge-F1 (%)Macro-F1 (%)FRA (%)LVR (%)Pos./S
Base27.0516.4371.60.0270.1125
+AP46.4536.8275.60.1160.2530
+AP+Logic52.1429.8577.00.0860.2690
+AP+Fb39.3829.2975.10.0910.2320
Full53.6342.5677.80.2440.2985
Gold0.4790
AP = Atomic probe; Fb = Feedback; Pos./S = average predicted positive relations per sample. Edge-F1, Macro-F1, FRA, and LVR are reported in %. The Gold row reports the average number of ground-truth positive relations per sample. The ablation results show non-monotonic component interactions rather than independent monotonic gains from each component.
Table 5. Efficiency and complexity analysis.
Table 5. Efficiency and complexity analysis.
MethodRds.Params (M)Time (ms)FRA (%)LVR (%)
Image-onlyN.A.3.897.8271.60.027
Global fusionN.A.3.897.8472.80.035
Ours-1R13.897.5576.30.115
Ours-2R23.898.0277.80.244
Ours-3R33.898.4574.30.028
Rds. = feedback rounds. Params are trainable parser parameters in millions. Time is inference time per sample in ms. FRA and LVR are reported in %.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jian, P.; Zhang, X.; Wu, L.; Sun, Q. Text-Guided Geometric Relation Parsing with Logic Regularization. Electronics 2026, 15, 2460. https://doi.org/10.3390/electronics15112460

AMA Style

Jian P, Zhang X, Wu L, Sun Q. Text-Guided Geometric Relation Parsing with Logic Regularization. Electronics. 2026; 15(11):2460. https://doi.org/10.3390/electronics15112460

Chicago/Turabian Style

Jian, Pengpeng, Xuhui Zhang, Lei Wu, and Quanhong Sun. 2026. "Text-Guided Geometric Relation Parsing with Logic Regularization" Electronics 15, no. 11: 2460. https://doi.org/10.3390/electronics15112460

APA Style

Jian, P., Zhang, X., Wu, L., & Sun, Q. (2026). Text-Guided Geometric Relation Parsing with Logic Regularization. Electronics, 15(11), 2460. https://doi.org/10.3390/electronics15112460

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop