1. Introduction
The swift expansion of digital legislation, administrative regulations, compliance manuals, and policy frameworks has generated a heightened demand for computational techniques that can convert unstructured legal and policy text into structured formats appropriate for automated reasoning and decision-making support. In several practical contexts, legal and policy determinations rely not solely on the interpretation of discrete sections but also on the simultaneous fulfillment of various limits, exceptions, thresholds, dependencies, and mutually exclusive requirements. This renders legal and policy analysis a notably complex area for Natural Language Processing (NLP), as the foundational thinking process is frequently combinatorial rather than solely verbal.
Recent advancements in Legal NLP have markedly enhanced the automated analysis of legal and regulatory texts, encompassing document classification, named entity recognition, semantic search, argument mining, and information extraction. Nevertheless, several current methodologies continue to concentrate on superficial retrieval or discrete extraction tasks and fail to adequately tackle the shift from textual provisions to formal decision models. Specifically, policy information extraction frequently discerns pertinent elements, responsibilities, prohibitions, eligibility criteria, or exceptions, yet fails to transform this information into machine-readable formats that facilitate robust computational reasoning. The disparity between the comprehension of legal texts and formal decision analysis remains inadequately addressed.
This constraint is particularly significant in activities such as eligibility evaluation, compliance validation, requirement fulfillment, conflict identification, and policy-driven resource distribution. Such tasks often necessitate the selection of a viable subset of criteria, the identification of a minimal covering set of requirements, the exclusion of incompatible alternatives, or the verification of the joint satisfiability of a collection of constraints. From the standpoint of theoretical computer science, these activities are intricately associated with the canonical combinatorial problem classes, including classical set- and graph-based formulations introduced in complexity theory, such as subset sum, partition, set packing, set covering, independent set, and vertex cover [
1]. The Cook–Karp framework offers not only a historical complexity-theoretic categorization but also a valuable formal foundation for modeling decision-oriented legal and policy activities [
2]. Prior work on set partitioning and subset construction has also shown the importance of set-based transformations for NP-complete formulations [
3].
Simultaneously, advancements in dense semantic retrieval present novel prospects for tackling the document-related aspects of this issue. Specifically, MPNet-based sentence embeddings have significant capability in capturing semantic similarity beyond mere keyword overlap, rendering them appropriate for extracting legally pertinent passages from extensive and diverse corpora. However, retrieval alone is inadequate. To facilitate formal reasoning, the retrieved text must undergo additional processing via Legal NLP and policy information extraction techniques that can identify normative operators, thresholds, exceptions, actor roles, temporal circumstances, and other constraint-bearing elements. Only subsequent to this modification can the extracted content be aligned with formal computational structures.
This study proposes an integrated approach that connects semantic legal text processing with formal combinatorial modeling [
4]. Within this system, MPNet functions as the semantic retrieval element for identifying pertinent passages, Legal NLP facilitates the extraction of policy-relevant constraints, and formal set-based modeling converts these constraints into decision structures appropriate for computational analysis.
The proposed methodology integrates three components that are often studied separately:
- (1)
MPNet-based dense retrieval for context-aware identification of pertinent legal and policy fragments;
- (2)
Legal NLP and policy information extraction for discerning obligations, prohibitions, exceptions, thresholds, and eligibility criteria;
- (3)
formal mapping of the extracted elements to canonical NP-complete problem frameworks, specifically set cover, set packing, subset sum, vertex cover, and independent set. Unlike methods confined to retrieval or extraction, the current framework facilitates decision-oriented reasoning regarding legal and policy documents [
5].
The framework is evaluated on a legal-policy corpus with document-level and passage-level annotations designed for retrieval, extraction, and decision-oriented formalization tasks. The evaluation includes lexical and reduced-system baselines, task-level performance analysis, and expert-validated qualitative case studies.
The main contributions of this paper are as follows. First, it proposes an integrated framework for transforming legal and policy text into formal constraint-based decision models. Second, it introduces an MPNet-based semantic retrieval layer for selecting legally relevant evidence. Third, it applies Legal NLP and policy information extraction to identify structured normative elements from retrieved passages. Fourth, it maps extracted constraints to canonical NP-complete formulations, including set cover, set packing, subset sum, partition, independent set, and vertex cover. Fifth, it evaluates the framework at retrieval, extraction, and end-to-end decision levels.
The subsequent sections of this work are structured as follows.
Section 2 examines pertinent literature on Legal NLP, policy information extraction, semantic retrieval utilizing transformer-based models, and the modeling of NP-complete problems in decision-support contexts.
Section 3 delineates the suggested technique, encompassing the MPNet retrieval layer, the Legal NLP extraction process, and the formal correspondence of extracted constraints to Cook–Karp NP-complete problem formulations.
Section 4 delineates the experimental design and assessment methodology.
Section 5 addresses the findings, constraints, and ramifications of the proposed framework. Ultimately,
Section 6 ends the work and delineates avenues for future research.
Unlike retrieval-only or extraction-only legal NLP pipelines, the present study evaluates the framework at three levels: evidence retrieval, structured constraint extraction, and end-to-end decision modeling. This design is intended to make the contribution empirically traceable and methodologically reproducible.
2. Literature Review
2.1. Legal NLP and Policy Information
In recent years, research on the computerized processing of legal and policy documents has expanded significantly, becoming a new subfield of Natural Language Processing known as Legal NLP. The field includes a variety of tasks such as legal information retrieval, document classification, named entity recognition, judgment prediction, semantic similarity, question answering, summarization, and information extraction. Despite recent progress, legal and policy texts remain challenging for modern NLP methods because of their length, structural complexity, specialized terminology, and reliance on context, cross-references, exceptions, and formal definitions. These qualities limit the effectiveness of solely surface-level or keyword-based approaches, highlighting the critical need for methods that can capture deeper semantic and logical structure.
Within this broader context, policy information extraction has become increasingly important because many practical legal and administrative tasks require more than simply finding relevant documents or passages. Real-world decision-making often depends on extracting obligations, prohibitions, permissions, thresholds, eligibility criteria, exceptions, temporal conditions, and actor-specific circumstances from unstructured text. Previous research has demonstrated that legal information extraction can aid downstream applications such as compliance analysis, legal decision assistance, regulatory monitoring, and automated eligibility evaluation. However, much of the literature still focuses on individual subtasks such as entity recognition or relation extraction rather than on translating legal language into formal computational models suitable for reasoning.
To position the present work more precisely,
Table 1 summarizes the difference between retrieval-oriented, extraction-oriented, and reasoning-oriented approaches.
The proposed framework is distinguished by the integration of all four components within a single decision-oriented pipeline.
Most prior work in Legal NLP stops at document retrieval, entity extraction, or relation identification. In contrast, the present study focuses on the next step: converting extracted legal-policy information into explicit constraint structures suitable for formal reasoning.
Recent benchmarks such as CUAD, ContractNLI, LexGLUE, MAUD, and LegalBench illustrate the breadth of current legal NLP research, but they mainly evaluate understanding, retrieval, or reasoning at the textual level and do not directly address the transformation of extracted legal-policy conditions into explicit combinatorial decision structures.
2.2. Semantic Retrieval for Legal and Policy Text
Legal information retrieval and regulatory information retrieval are closely related research areas. In large legal corpora, related clauses are often distributed across multiple documents and expressed in different terminological forms. As a result, recent research has shifted away from lexical matching and toward dense semantic retrieval models. Transformer-based embedding models improve retrieval quality by capturing semantic similarity beyond exact word overlap. Among these models, MPNet has received special interest because it combines powerful contextual representation learning with robust sentence-level semantic encoding. In practical retrieval workflows, MPNet-based embeddings are often used to identify passages that are conceptually relevant even when they differ substantially in wording from the query [
6]. This makes them particularly useful for legal and policy fields, where semantically linked norms may occur in various linguistic formulations.
However, retrieval does not address the basic issue of decision-oriented legal reasoning. A retrieved section may be significant, but it still needs to be analyzed and transformed into structured information. In legal and policy contexts, this entails determining which sections of the text express mandatory conditions, establish exclusions or exceptions, define quantitative thresholds, and specify relationships among actors, activities, and resources. As a result, an effective legal reasoning pipeline must combine semantic retrieval with an information extraction layer capable of converting textual norms into machine-readable structures.
This distinguishes the present work from retrieval-centered pipelines, because semantic search is treated here as an evidence selection module rather than the final output of the system.
Similarly, recent retrieval-focused resources such as COLIEE, ECtHR-PCR, and RAG-oriented legal systems emphasize evidence discovery and answer support, whereas the present study treats retrieval as one stage in a broader decision-oriented architecture with an explicit formal intermediate model [
7].
2.3. Combinatorial Modeling for Decision-Oriented Legal Tasks
This creates a natural link between Legal NLP and formal decision modeling. Many legal and policy tasks can be viewed as combinatorial decision problems rather than as problems of language understanding alone. Eligibility determination, compliance verification, requirement coverage, conflict-free selection, and resource-constrained policy decisions are frequently dependent on whether a given set of conditions can be jointly satisfied, a minimal set of requirements covers all obligations, or incompatible alternatives can be excluded. In terms of theoretical computer science, such tasks are closely related to the canonical NP-complete problems presented by Stephen Cook and Richard Karp [
8]. Set cover, set packing, subset sum, partition, independent set, and vertex cover provide useful formal models for requirement satisfaction, conflict resolution, threshold-based selection, and coverage optimization.
This observation is important because it links unstructured legal text with formal computational reasoning. The literature on NP-complete problems has generally concentrated on abstract decision formulations in which the formal input is already known [
9]. In contrast, Legal NLP research frequently ends at retrieval or extraction without moving on to formal problem design. As a result, the connection between legal text analytics and classical combinatorial modeling remains underdeveloped. Although some research has focused on computational legal formalization and rule-based compliance systems, there has been little work on integrated frameworks that start with raw legal or policy text, retrieve relevant evidence semantically, extract operational constraints, and then map those constraints to standard NP-complete formulations [
10].
This gap is especially visible in policy-oriented intelligent systems. In many applied settings, the task is not only to identify relevant legal passages, but also to construct a formal representation of the decision problem implied by those passages. Such a representation is necessary for clear and reproducible reasoning, especially when multiple rules interact through dependencies, exceptions, and thresholds. Without this intermediate formalization stage, retrieval and extraction remain informative but insufficient for systematic decision support.
The literature therefore points to the need for a unified framework that connects semantic retrieval, legal and policy information extraction, formal constraint representation, and combinatorial decision modeling [
11]. Semantic retrieval identifies the most relevant legal or policy fragments; information extraction converts those fragments into structured normative elements; formal modeling represents those elements as explicit constraints; and combinatorial formulations provide a rigorous basis for reasoning over such constraints. The present study is positioned at this intersection and aims to combine these components within a single evaluated framework.
The main methodological difference of the proposed framework is that it connects semantic retrieval, structured legal extraction, and canonical NP-complete modeling within a single evaluated pipeline. From this perspective, the contribution of the present work lies not in proposing a new NP-complete formulation, but in operationalizing a bridge between legal text analytics and classical combinatorial problem structure within a single evaluated pipeline.
3. Materials and Methods
3.1. General Research Design
The present study is designed as an applied computational framework for transforming unstructured legal and policy documents into formal decision models suitable for constraint-aware reasoning. The methodological objective is not limited to text retrieval or isolated information extraction. Instead, the proposed framework integrates semantic retrieval, Legal NLP, structured policy information extraction, and formal combinatorial modeling within a single processing pipeline [
12].
The overall workflow is organized as a sequence of five main stages. First, a corpus of legal and policy documents is collected and preprocessed in order to obtain clean textual units suitable for semantic indexing. Second, MPNet-based dense retrieval is applied to identify the most relevant passages for a given legal or policy query. Third, the retrieved passages are processed through a Legal NLP layer that extracts policy-relevant information, including obligations, prohibitions, exceptions, eligibility conditions, thresholds, and actor-specific constraints. Fourth, the extracted information is normalized into a structured representation that can be interpreted as a set of formal constraints. Fifth, these constraints are mapped to canonical NP-complete problem formulations introduced by Stephen Cook and Richard Karp, particularly set cover, set packing, subset sum, partition, independent set, and vertex cover [
13].
In methodological terms, the framework treats legal and policy reasoning as a transformation problem from unstructured text to machine-interpretable decision structures. This makes it possible to connect the semantic flexibility of transformer-based retrieval with the formal rigor of combinatorial decision modeling.
At the implementation level, the framework consists of a corpus preparation stage, an MPNet-based retrieval stage, a structured extraction stage, a formal mapping stage, and a task-specific decision layer. Each stage produces an intermediate representation that is used as input for the next stage [
14].
The overall processing logic of the proposed framework is summarized in Algorithm 1.
| Algorithm 1. End-to-End Legal-to-Decision Pipeline |
Input: corpus D, query q, top-K value K Output: final structured decision O(q) 1. Preprocess the corpus D. 2. Segment documents into passages P. 3. Encode q and P using MPNet. 4. Rank passages by cosine similarity. 5. Select top-K passages. 6. Extract structured policy constraints. 7. Normalize extracted constraints into C(q). 8. Map C(q) to a canonical NP-complete formulation. 9. Construct the formal instance I(q). 10. Solve or verify the instance. 11. Return O(q). |
This architectural decomposition is methodologically important because it separates semantic matching from formal reasoning while preserving a clear transformation path from unstructured legal text to machine-interpretable decision models. As a result, the framework supports both transparency and reproducibility, since each intermediate representation can be inspected independently [
15,
16].
3.2. Corpus of Legal and Policy Documents
The input data of the proposed framework consist of legal and policy documents expressed in natural language. Depending on the application setting, such documents may include statutes, administrative regulations, policy guidelines, institutional rules, compliance instructions, eligibility criteria, or internal governance documents. Since these sources are typically heterogeneous in structure and writing style, the first methodological step consists of standardizing them into a unified textual corpus [
17].
Document preprocessing includes format conversion, text normalization, sentence segmentation, paragraph segmentation, and the removal of non-informative structural elements when necessary. At this stage, each document is transformed into a collection of textual units that can serve as retrieval candidates [
18]. The segmentation granularity may be adapted to the legal genre, but in the general case, the framework operates on paragraph-level or passage-level units in order to preserve semantic coherence while avoiding excessively long text fragments.
Formally, let the corpus be denoted as Equation (1):
where each
is a legal or policy document. After segmentation, each document is represented as a set of textual passages, as in Equation (2):
The full set of retrieval units is therefore Equation (3):
This representation allows the framework to operate over semantically meaningful passages rather than entire documents, which is particularly important in legal and policy analysis because relevant conditions are often localized in specific fragments rather than uniformly distributed across a full text [
19].
For example, a passage may take the following form: “Applicants must provide proof of residence, proof of income, and a valid identification document before the filing deadline. Applicants with incomplete residency status are not eligible.” In the framework, this passage is treated as a single retrieval unit because it expresses a coherent set of related eligibility and exclusion conditions.
3.2.1. Corpus Composition and Sources
The experimental corpus was designed to support retrieval, structured extraction, and decision-oriented formalization over legal and policy text. It consisted of documents collected from four source categories: statutory and constitutional texts, administrative regulations, institutional policy documents, and eligibility- or compliance-oriented guidance materials. After deduplication and preprocessing, the corpus contained 1284 documents, which were segmented into 9736 passages for retrieval and downstream analysis.
The corpus used in the experiments included documents in Kazakh and Russian, with a smaller number of supporting administrative materials in English used for methodological testing and interface illustration.
The corpus was assembled from publicly accessible statutory, regulatory, and policy materials rather than from a previously released benchmark collection.
The source composition of the corpus was as follows: 312 statutory or constitutional documents, 428 administrative and regulatory documents, 267 institutional policy and governance documents, and 277 eligibility, procedural, or compliance guidance documents. The corpus was primarily passage-oriented, since relevant legal conditions were often localized in bounded textual fragments rather than uniformly distributed across full documents.
After segmentation, the average document length was 7.58 passages, and the median passage length was 94 tokens. The interquartile range of passage length was 61–138 tokens, which was considered suitable for semantic retrieval because it preserved local legal context while avoiding excessively long units that could weaken relevance ranking.
For evaluation, an annotated subset of the corpus was constructed at two levels. First, a retrieval benchmark was created from 240 legal-policy queries, linked to 1126 relevant passages and 2984 non-relevant candidate passages. Second, a structured extraction benchmark was prepared over 1540 annotated constraint instances, including eligibility, exclusion, threshold, temporal, dependency, and conflict constraints. These annotations were then used to evaluate retrieval quality, extraction performance, and end-to-end decision construction.
The final corpus design was intended to balance document diversity with structural consistency. In particular, the inclusion of both normative legal texts and applied policy documents made it possible to evaluate the proposed framework not only on formal legal language, but also on operational rule settings in which decision support is practically required.
This corpus composition provides the empirical basis for the retrieval, extraction, and formal modeling stages described in the following sections.
3.2.2. Annotation Protocol
The annotation protocol was designed to support evaluation at both retrieval and extraction levels. Annotation was conducted in two stages. In the first stage, passage-level relevance labels were assigned for retrieval evaluation. In the second stage, constraint-level labels were assigned for structured policy information extraction and downstream formal modeling.
For the retrieval benchmark, each query was paired with a candidate set of passages sampled from the segmented corpus. A passage was labeled as relevant if it contained information necessary for answering the legal-policy query, establishing a required condition, identifying an applicable exclusion or exception, or contributing directly to the construction of the downstream formal decision instance. Passages that were topically related but did not provide decision-relevant evidence were labeled as non-relevant. This protocol yielded 240 annotated queries, 1126 relevant passages, and 2984 non-relevant candidate passages.
For the extraction benchmark, annotation was performed over passages retrieved or selected as policy-relevant. Each annotated instance was assigned to one of six constraint categories: eligibility, exclusion, threshold, temporal, dependency, and conflict. In addition to category labeling, annotators marked the core structural fields required by the framework: the regulated subject, the normative operator or relation, the target object or value, and any contextual qualifier when present. This stage produced 1540 annotated constraint instances distributed across the six categories.
The annotation process involved three annotators with background familiarity in legal-policy document analysis and computational text processing. Annotation guidelines were developed in advance and refined through pilot labeling on a small calibration subset of the corpus. Disagreements were resolved through discussion and adjudication, and the final label set was reviewed for consistency before being used in the reported experiments.
To assess annotation reliability, a randomly selected subset comprising approximately 15% of the annotated material was independently labeled by more than one annotator. On this subset, the inter-annotator agreement reached 0.84 Cohen’s kappa for passage-level relevance and 0.81 Cohen’s kappa for constraint-category assignment. These values indicate substantial agreement and support the reliability of the annotation protocol for retrieval and extraction evaluation. The annotation guidelines specified category definitions, boundary criteria, and the treatment of exceptions, cross-references, and nested conditions.
The resulting annotations served three purposes within the study. First, they provided the ground truth for evaluating semantic retrieval. Second, they supported category-level and overall extraction scoring. Third, they enabled the validation of end-to-end decision construction by linking textual evidence, structured constraints, and formal NP-complete problem instances within a shared evaluation framework.
The annotated corpus therefore provides a consistent empirical basis for measuring not only evidence retrieval and structured extraction, but also the correctness of the downstream formalization process.
3.2.3. Data Split and Evaluation Partitions
The annotated material was divided into development and evaluation partitions. The retrieval benchmark was split into 60 development queries and 180 evaluation queries. The extraction benchmark was divided into 1210 training and development instances and 330 held-out evaluation instances. The end-to-end task analysis was conducted on 379 formalized task instances, grouped into eligibility assessment, compliance verification, requirement coverage, and conflict-free selection. The same evaluation partitioning was preserved across baseline and full-framework comparisons.
3.3. Query Representation and Task Definition
The framework is intended for decision-oriented legal and policy tasks, such as eligibility assessment, compliance verification, requirement coverage, conflict-free selection, and policy-based resource allocation. Each task is formulated through a query or case description that expresses the information need of the system.
Let the query be denoted by Equation (4):
where
is the set of all legal-policy queries under consideration. A query may represent a textual question, a case description, a decision request, or a structured policy objective. The purpose of the retrieval stage is to identify those passages
that are semantically relevant to
while the purpose of the extraction stage is to identify the normative content embedded in those passages.
Representative queries included:
- (i)
“What is the minimum evidence set required to establish eligibility for the procedure?”
- (ii)
“Which conditions exclude an applicant from eligibility?”
- (iii)
“What documents must be submitted before the filing deadline?”
- (iv)
“Which requirements must be jointly satisfied for compliance approval?”
The methodological assumption is that legal and policy decision support requires not simply retrieving similar text but constructing a formal model of the constraints implied by the relevant passages. Therefore, the query acts as the entry point to the entire pipeline, linking the corpus side of the problem to the subsequent stages of extraction and combinatorial modeling [
20].
3.4. Passage Segmentation and Chunking Strategy
Because legal and policy documents are often long and structurally heterogeneous, the segmentation of documents into retrieval units is a critical methodological step. The quality of semantic retrieval depends not only on the embedding model, but also on the granularity of the textual fragments used for indexing [
21]. Excessively short segments may lose context, whereas excessively long segments may dilute the relevance signal and reduce retrieval precision. In the proposed framework, document segmentation is performed at the passage level, where a passage may correspond to a paragraph, a logically coherent multi-sentence unit, or a sliding window over adjacent sentences. The choice of passage-level chunking is motivated by two considerations. First, many legal and policy provisions are expressed within bounded textual regions that preserve internal coherence. Second, passage-level indexing allows semantically relevant fragments to be retrieved without requiring the entire source document to be processed as a single unit [
22,
23].
The passage concept used in this subsection is the same as the one introduced in
Section 3.2: a passage is a bounded textual retrieval unit derived from the source document and later converted into a structured representation only after the extraction stage.
In practical terms, a passage is not a markup layer or a fully structured object at the retrieval stage. It is a bounded textual unit derived from the source document, typically a paragraph or a short multi-sentence segment, which is later converted into a structured representation only after the extraction stage.
A simplified example of the preprocessing and passage segmentation procedure is shown below in Listing 1. The code illustrates how raw legal text can be normalized and divided into retrieval units while preserving local semantic coherence.
| Listing 1. Example of passage segmentation and preprocessing. |
| import re |
from typing import List |
| def normalize_text(text: str) -> str: |
| text = re.sub(r″\s+″, ″ ″, text) |
| text = re.sub(r″[^\\S\r\n]+″, ″ ″, text) |
return text.strip() |
| def segment_passages(text: str, max_sentences: int = 3) -> List[str]: |
| text = normalize_text(text) |
| sentences = re.split(r′(?<=[.!?])\s+′, text) |
passages = [] |
| for i in range(0, len(sentences), max_sentences): |
| chunk = ″ ″.join(sentences[i:i + max_sentences]).strip() |
| if chunk: |
passages.append(chunk) |
return passages |
In Listing 1, re refers to Python’s 3.12 regular-expression library. The sub() function is used to remove or normalize unwanted text patterns, while split() separates the text into sentence-like units. The final join() step reconstructs short neighboring sentences into passage-sized chunks suitable for retrieval.
This preprocessing logic reflects the passage-level strategy used in the proposed framework. The goal is to avoid excessively short fragments that lose context and excessively long fragments that weaken the retrieval signal.
3.5. MPNet-Based Semantic Retrieval
The retrieval component of the framework is based on MPNet sentence embeddings. Each query
and each passage
are transformed into dense vector representations in a shared semantic embedding space. Let Equation (5):
denote the MPNet-based embeddings of the query and the passage, respectively, where
is the dimensionality of the embedding space.
Semantic relevance is computed using the cosine similarity in Equation (6):
For a given query
, the retrieval system ranks all candidate passages
according to the value of
. The top-
passages are then selected as the evidence set in Equation (7):
where the selected passages satisfy. The role of this stage is to identify semantically relevant legal or policy fragments even when exact lexical overlap is limited using Equation (8):
This is especially important for legal and regulatory corpora, where conceptually similar conditions may be expressed through different terminological forms. The resulting evidence set serves as the input to the Legal NLP and policy information extraction layer.
The passage-ranking procedure used in the retrieval layer is formalized in Algorithm 2.
| Algorithm 2. MPNet-Based Legal Passage Retrieval |
1. using MPNet. 2. . 3. . 4. Rank passages by similarity. 5. passages. |
This retrieval algorithm identifies semantically relevant legal passages even when exact lexical overlap is weak. It is particularly suitable for legal and policy corpora, where equivalent obligations and exceptions are often expressed using different wording. This procedure ensures that the top-
evidence set is selected according to semantic similarity in the shared embedding space [
24,
25].
The following code fragment in Listing 2 illustrates the retrieval logic implemented in the semantic search layer. It encodes the query and candidate passages in a shared embedding space and ranks them by cosine similarity.
| Listing 2. Example of MPNet retrieval with cosine similarity. |
from sentence_transformers import SentenceTransformer, util |
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2") |
| def retrieve_top_k(query: str, passages: list[str], k: int = 5): |
| query_emb = model.encode(query, convert_to_tensor=True) |
passage_embs = model.encode(passages, convert_to_tensor=True) |
| scores = util.cos_sim(query_emb, passage_embs)[0] |
| ranked = sorted( |
| zip(passages, scores.tolist()), |
| key=lambda x: x[1], |
| reverse=True |
) |
return ranked[:k] |
In Listing 2, the encode() function converts a query or passage into a dense numerical vector in the MPNet embedding space. Cosine similarity is then used to compare these vectors and rank passages by semantic relevance. The code is shown only as a simplified illustration of the retrieval logic rather than as a complete implementation. The code examples in Listings 1 and 2 are included only to illustrate the processing logic and are not intended to require programming knowledge from the reader.
This snippet shows the core mechanism of the retrieval stage. The query and legal-policy passages are embedded in the same semantic space, and the top- passages are selected according to similarity score. In the full framework, this evidence set is passed to the extraction and formalization layers.
Retrieval Configuration
The dense retrieval layer used the sentence-transformers/all-mpnet-base-v2 model to encode both queries and passages into a shared embedding space of 768 dimensions. Passage embeddings were computed offline and stored for retrieval. Candidate ranking was first performed using cosine similarity, after which an optional reranking stage was applied to the top-ranked candidates.
The value of was selected on the development partition by comparing The best trade-off between evidence recall and extraction stability was obtained at , which was therefore used in all reported experiments.
Lexical baselines included keyword matching, TF-IDF cosine similarity, and BM25. In the reranking condition, the top-5 passages were reordered using a lightweight cross-encoder scoring stage. All compared variants used the same passage segmentation and evaluation partitions.
3.6. Legal NLP and Policy Information Extraction
After semantic retrieval, the selected passages are processed through a Legal NLP layer whose purpose is to extract structured normative information. In the present framework, extraction is centered on the elements that are most relevant for downstream decision modeling [
26,
27]. These include obligations, prohibitions, permissions, eligibility criteria, exceptions, quantitative thresholds, temporal constraints, actor-role relations, and document or evidence requirements.
Each retrieved passage
is analyzed in order to identify one or more structured policy elements. Let the extracted element be represented as Equation (9):
where
is the type of normative element,
is the actor or subject,
is the operator or relation,
is the value or target object, and
is the contextual condition under which the element applies. The extracted set of policy elements for a query
is denoted by Equation (10):
In practical terms, this stage transforms semantically relevant but still unstructured text into a structured set of decision-relevant units. For example, the clause “The applicant must submit proof of residence, proof of income, and a valid identification document before the filing deadline, except in cases of emergency processing” would be converted into explicit elements corresponding to obligation, required evidence, temporal condition, and exception [
28,
29]. This representation is crucial because downstream combinatorial reasoning cannot operate directly on free text. It requires explicit units that can be interpreted as formal constraints.
To make the extracted information suitable for downstream reasoning, textual policy elements must be converted into a structured representation. A simplified normalization example is given below in Listing 3.
| Listing 3. Example of constraint extraction and normalization. |
| from dataclasses import dataclass |
from typing import Optional |
| @dataclass |
| class PolicyConstraint: |
| ctype: str |
| subject: str |
| operator: str |
| value: str |
condition: Optional[str] = None |
| example_constraint = PolicyConstraint( |
| ctype="eligibility", |
| subject="applicant", |
| operator="must_submit", |
| value="proof_of_residence", |
| condition="before_deadline" |
| ) |
In the example shown in Listing 3, ctype denotes the constraint category, such as eligibility, exclusion, threshold, temporal, dependency, or conflict. The subject field identifies the regulated actor, such as applicant, agency, or institution. The operator field expresses the normative relation, for example, must_submit, not_eligible, before, or requires. The value field stores the required object, threshold, or condition, while condition captures any contextual qualifier when present.
This structured representation makes it possible to preserve the main legal-policy meaning of a passage while transforming it into a machine-interpretable constraint object. Such objects can then be mapped to formal decision models.
3.7. Constraint Taxonomy
A central methodological component of the proposed framework is the classification of extracted legal and policy information into a structured taxonomy of constraints. This step is necessary because legal text contains multiple forms of normative content, and these forms cannot be treated as a single undifferentiated category if the downstream objective is formal decision modeling.
Because several technical categories recur across the framework,
Table 2 provides short definitions of the main legal-policy elements used in retrieval, extraction, and formal modeling.
These definitions are intended to make the terminology of the framework more accessible, especially for readers who are not specialists in legal NLP or formal decision modeling.
In the proposed framework, extracted constraints are divided into six main categories. The first category includes eligibility constraints, which specify whether a subject satisfies the conditions required for inclusion, access, or participation. The second category includes exclusion constraints, which define disqualifying conditions or exceptions that prevent eligibility [
30]. The third category includes threshold constraints, which impose numerical limits or target values, such as minimum income, maximum budget, or age limits. The fourth category includes temporal constraints, which specify deadlines, validity periods, waiting intervals, or event ordering requirements. The fifth category includes dependency constraints, which indicate that one requirement or action is conditional on another. The sixth category includes conflict constraints, which represent mutual incompatibility between legal options, obligations, or policy conditions.
Formally, each extracted constraint is represented as Equation (11):
where
denotes the constraint type,
denotes the subject or regulated entity,
denotes the operator or relation,
denotes the object, value, or target condition, and
denotes the contextual qualifier. This representation allows the framework to preserve the semantic diversity of legal norms while converting them into a unified formal structure suitable for combinatorial analysis. The proposed taxonomy also improves interpretability because it makes explicit which type of legal-policy condition is being modeled at each stage of the pipeline. This is particularly important in legal and regulatory settings, where exceptions, dependencies, and temporal conditions often play a decisive role in the final decision.
3.8. Constraint Normalization and Formal Representation
The extracted policy elements are further normalized into a formal constraint representation. Let the normalized constraint set associated with query
be denoted by Equation (12):
Each constraint
is represented in predicate-like form, as in Equation (13):
where
is the regulated entity or attribute,
is the relation or operator, and
is the required value, resource, category, threshold, or linked condition.
Depending on the policy text, constraints may encode inclusion, exclusion, threshold satisfaction, coverage requirements, compatibility requirements, or conflict relations. For example, a mandatory requirement may be represented as an inclusion constraint, an exception may be represented as a conditional negation constraint, a threshold rule may be represented as an inequality, and an incompatibility relation may be represented as a conflict edge between two candidate elements [
31].
At this stage, the textual meaning of the legal or policy fragments is transferred into a machine-interpretable form. The formal representation is designed to preserve enough structure for downstream mapping to canonical NP-complete formulations while remaining sufficiently general to cover different legal-policy domains.
3.9. Mapping to Canonical NP-Complete Problem Formulations
The central methodological step of the framework is the mapping of the normalized constraint set to one or more canonical NP-complete problem formulations. This step provides the formal computational basis for decision-oriented reasoning over legal and policy texts.
3.9.1. Set Cover Formulation
The set cover formulation is used when the decision task requires identifying a minimum set of rules, documents, actors, resources, or conditions that collectively satisfy all required policy elements. Let Equation (14):
be the universe of required policy conditions, and let Equation (15):
be a family of subsets such that each
covers part of the requirement set. The objective is to identify a subfamily
such that Equation (16):
and
is minimized.
In legal and policy applications, this formulation is appropriate for requirement coverage, minimal document selection, or minimal evidence satisfaction.
3.9.2. Set Packing Formulation
The set packing formulation is used when the task requires selecting a maximum set of mutually compatible options. Given the same family
, the objective is to find a subfamily using Equation (17):
such that Equation (18):
while maximizing
In policy-oriented settings, this is useful for conflict-free admissible selection, where two alternatives cannot be chosen together because they violate mutual exclusivity or incompatibility constraints.
3.9.3. Subset Sum and Partition Formulations
Threshold-based legal and policy tasks may be modeled through subset sum or partition. Let Equation (19):
be a finite set of weighted items, and let
be a target threshold. The subset sum formulation asks whether there exists a subset, as in Equation (20):
such that Equation (21):
where
is the weight associated with item
.
The partition formulation asks whether
can be divided into two disjoint subsets
and
and such that Equation (22):
and Equation (23):
These formulations are relevant when legal or policy decisions involve balancing, threshold satisfaction, or resource allocation under numeric constraints.
3.9.4. Independent Set and Vertex Cover Formulations
Graph-based formulations are used when the policy problem includes conflict relations or dependency structures. Let Equation (24):
be a graph in which vertices represent candidate legal-policy elements and edges represent incompatibility or conflict. An independent set is a subset in Equation (25):
such that no two vertices in
I are adjacent. This formulation is suitable for conflict-free admissible selection. A vertex cover is a subset of Equation (26):
such that every edge in
is incident to at least one vertex in
. This formulation is appropriate when the objective is to resolve or cover all conflicts through a minimum controlling set. Together, these mappings provide a formal bridge between extracted legal-policy constraints and the combinatorial structure of decision problems.
The mapping stage can also be described procedurally. The following simplified implementation in Listing 4 shows how extracted constraints may be assigned to a canonical NP-complete formulation according to task structure.
| Listing 4. Example of formal mapping to an NP-complete model. |
| def select_formal_model(constraints: list[dict]) -> str: |
types = {c["type"] for c in constraints} |
| if "coverage" in types: |
| return "set_cover" |
| if "conflict" in types: |
| return "independent_set" |
| if "threshold" in types: |
| return "subset_sum" |
| if "balance" in types: |
| return "partition" |
| if "exclusion" in types: |
return "set_packing" |
| return "generic_constraint_model" |
In the full framework, the mapping stage is more detailed and depends on the interaction between constraint categories. However, this simplified example makes clear that model selection is driven by the logical structure of the extracted policy conditions rather than by surface text alone.
3.9.5. Mapping Heuristics
The assignment of extracted constraints to a canonical NP-complete formulation followed a rule-based heuristic. Set cover was selected when the task required satisfying all required policy conditions with a minimum set of evidentiary or decision elements. Set packing and independent set were selected when the dominant structure was conflict-free admissible selection. Subset sum was selected for threshold satisfaction problems involving a target numeric value. Partition was used for balancing or allocation problems with symmetric resource division. Vertex cover was selected when the task required covering or controlling all detected conflict edges.
When multiple formulations were plausible, preference was given to the model that minimized representational ambiguity and preserved the dominant decision structure of the task.
3.9.6. Theoretical Justification of the Mapping
The mapping from extracted legal-policy constraints to canonical NP-complete formulations is heuristic in its implementation, but it is not arbitrary. It is grounded in the structural correspondence between recurring legal-policy task patterns and standard combinatorial decision templates. In particular, requirement coverage tasks correspond to set cover because the objective is to select a minimum family of elements that jointly satisfies all required conditions. Conflict-free admissible selection corresponds to set packing or independent set because mutually incompatible alternatives cannot be chosen simultaneously. Threshold-based admissibility corresponds to subset sum when a target value must be reached through a subset of weighted items. Balanced allocation corresponds to partition when the decision requires dividing weighted elements into comparable groups. Conflict-control tasks correspond to vertex cover when all incompatibility edges must be incident to a selected controlling set.
The theoretical claim of the present study is therefore not that legal reasoning as a whole reduces to NP-complete computation, but that a practically important subclass of policy-oriented decision tasks exhibits structural patterns that can be represented through classical combinatorial models. The framework uses this correspondence as an operational formalization principle.
The current framework, therefore, uses these formulations primarily as interpretable structural models for legal-policy decision tasks rather than as a claim of exact complexity-optimal solving of general NP-complete instances.
3.10. End-to-End Decision Model
After the mapping stage, the framework produces a formal problem instance, as in Equation (27):
where
is the set of candidate decision elements,
is the normalized constraint set, and
is the selected combinatorial modeling scheme [
32]. The decision function is represented as Equation (28):
The abstract transformation implemented by the framework is summarized in
Figure 1.
Figure 1 emphasizes that the system is designed not only to retrieve relevant text, but also to construct a machine-interpretable formal decision instance.
For optimization-oriented scenarios, the framework may additionally seek a minimum, maximum, or threshold-satisfying solution depending on the selected formulation. This end-to-end structure is the key methodological contribution of the proposed framework. Instead of treating legal retrieval, policy extraction, and combinatorial modeling as separate tasks, it connects them within a single computational architecture.
3.11. Reproducibility and Methodological Scope
The proposed methodology is intended as a reproducible framework rather than a domain-specific rule set. Its generality lies in the separation between semantic retrieval, structured extraction, and formal combinatorial modeling. The framework can therefore be adapted to different legal and policy domains provided that the document corpus, query set, and annotation or validation protocol are appropriately defined. At the same time, the methodological scope of the framework should be stated clearly. It does not assume that all legal reasoning can be reduced to NP-complete problem solving, nor does it claim that every legal-policy task has a single exact canonical mapping. Instead, it assumes that a substantial subset of policy-oriented decision tasks can be represented through standard combinatorial formulations once the relevant constraints have been extracted from text. In this sense, the framework is designed as an applied computational bridge between Legal NLP and formal decision modeling [
33].
The present study does not claim exact complexity-optimal solving of general NP-complete instances. Instead, the framework uses well-known combinatorial formulations as structured decision models for representing legal-policy task patterns. The current emphasis is therefore on formalization and interpretability rather than on worst-case complexity analysis or exact large-scale optimization.
3.11.1. Experimental Setup
The experiments were conducted in three stages: retrieval evaluation, extraction evaluation, and end-to-end task evaluation. Retrieval performance was measured using Precision@5, Recall@5, MRR, and nDCG@5. Extraction quality was measured using precision, recall, and F1-score over constraint annotations. End-to-end performance was measured using accuracy, precision, recall, and F1-score across four task families: eligibility assessment, compliance verification, requirement coverage, and conflict-free selection.
Baseline comparisons were performed against lexical retrieval, dense retrieval without extraction, and retrieval-plus-extraction without formal mapping. The same evaluation partitions were used across comparative variants in order to ensure consistency.
All retrieval baselines operated on the same segmented passage corpus. The lexical baselines used the same query set and the same evaluation partitions as the dense retrieval model. For extraction evaluation, only passages available under the corresponding retrieval condition were used, which ensured that downstream comparisons reflected the contribution of each upstream module.
In addition to lexical baselines, the evaluation also considered a transformer-based semantic baseline without task-specific formalization. This baseline used sentence-level dense embeddings for retrieval but did not include the structured extraction and NP-complete mapping stages. Its role was to distinguish the contribution of dense semantic encoding from the contribution of the full decision-oriented architecture.
To assess score stability, the main retrieval and end-to-end metrics were additionally estimated using nonparametric bootstrap resampling over the evaluation partition. Confidence intervals were computed from 1000 bootstrap samples and are reported at the 95% level for the principal comparative results.
3.11.2. Implementation Details
The system was implemented in Python using transformer-based embedding models for dense retrieval and a modular pipeline for extraction and formalization. Sentence-transformer components were used for MPNet encoding, while sparse baselines were implemented using TF-IDF and BM25 retrieval. The formal mapping and decision layer were implemented as deterministic transformation and verification procedures over extracted constraints.
The implementation used fixed random seeds for all stochastic components and shared preprocessing across all compared variants.
3.11.3. End-to-End Walkthrough of a Single Query
To make the processing logic of the framework more transparent, it is useful to summarize how a single query is handled from input to output. Suppose the system receives a policy question asking whether an applicant is eligible for a regulated administrative procedure and which documents are minimally required. In the first stage, the query is matched against the segmented legal-policy corpus through MPNet-based semantic retrieval. In the second stage, the top-ranked passages are passed to the extraction layer, which identifies obligations, exclusions, temporal requirements, and document-related conditions. In the third stage, the extracted elements are normalized into structured constraints. In the fourth stage, these constraints are assigned to the formal decision model that best matches the task structure. In the present example, the task is interpreted as a requirement-coverage problem and is therefore mapped to a set cover formulation. In the final stage, the resulting formal instance is solved or verified, and the system returns an interpretable decision together with the supporting evidence passages.
This walkthrough shows that the framework is not a loose collection of modules, but a sequential transformation from legal text to a structured decision object.
4. Results
This section reports the measured performance of the proposed framework at three levels: semantic retrieval, legal-policy information extraction, and end-to-end formal decision modeling. All reported values were obtained on the annotated evaluation partitions described in
Section 3. All experiments reported in this section were conducted on the corpus and annotated evaluation partitions described in
Section 3.2.1,
Section 3.2.2 and
Section 3.2.3, using legal-policy queries in Kazakh and Russian, with English used only in limited methodological testing and interface illustration.
4.1. Retrieval Performance
The first stage of the evaluation concerns the MPNet-based semantic retrieval module. Its objective is to identify the legal and policy passages most relevant to a given query or case description. The effectiveness of this component is measured using ranking-based retrieval metrics, including Precision@K, Recall@K, Mean Reciprocal Rank (MRR), and normalized Discounted Cumulative Gain (nDCG) [
34].
The retrieval results are reported against three baselines: keyword matching, TF-IDF cosine similarity, and BM25. The MPNet-based model retrieves conceptually relevant passages more reliably than all lexical baselines, especially when policy conditions are expressed through paraphrase or structurally different wording. This behavior is important in legal-policy corpora, where semantically equivalent obligations and restrictions are often phrased in different ways [
35].
The retrieval results show a clear progression from lexical to dense semantic methods. The MPNet-based model outperformed all lexical baselines, improving Precision@5 by 7.4 percentage points over BM25 and increasing MRR from 0.553 to 0.637. The addition of a reranking stage produced further gains, reaching an nDCG@5 of 0.716. This pattern is consistent with recent statutory retrieval studies showing that dense candidate generation and reranking improve top-rank legal retrieval quality over BM25-only pipelines, and with provision-level legal retrieval evidence indicating that MPNet candidate pools can offer stronger recall ceilings than sparse retrieval alone.
To provide a more intuitive comparison of lexical, sparse, and dense retrieval methods, the ranking metrics are visualized in
Figure 2.
As shown in
Figure 2, dense semantic retrieval consistently outperforms lexical baselines across all evaluated metrics. The strongest overall results are achieved by the MPNet + reranking configuration, which confirms that semantically informed retrieval provides the most reliable evidence base for downstream legal-policy reasoning.
4.2. Policy Information Extraction Results
The second level of evaluation concerns the Legal NLP and policy information extraction layer. The goal of this stage is to convert retrieved passages into structured normative elements such as obligations, prohibitions, thresholds, eligibility conditions, temporal restrictions, dependency relations, and conflict constraints. The quality of this stage is evaluated using precision, recall, and F1-score over annotated or expert-validated extraction targets.
Table 3 reports both overall extraction quality and category-level extraction results. This is methodologically useful because different types of legal information typically vary in extraction difficulty. Obligations and explicit eligibility conditions may be easier to identify than exceptions, temporal clauses, or cross-referenced exclusions. For that reason, it is preferable to present both aggregate and fine-grained results.
The extraction results indicate that the framework performs best on explicit eligibility and threshold expressions, which are often lexically marked and structurally stable. Lower performance is observed on conflict and dependency constraints, which more often depend on implicit wording, cross-reference interpretation, or multi-clause context. This overall profile is realistic for legal-policy IE and aligns with recent work showing that pretrained language models are effective for extracting legal restrictions and eligibility-related constraints, but that more structurally complex categories remain harder.
4.3. Mapping Results to Canonical NP-Complete Formulations
The third stage of the results section concerns the formal mapping of extracted constraints to canonical NP-complete problem classes. At this level, the main objective is to examine whether the legal-policy tasks identified in the corpus can be represented consistently through selected Cook–Karp formulations, particularly set cover, set packing, subset sum, partition, independent set, and vertex cover. A comparison across the formal model families used in the framework also helps explain why some legal-policy tasks are easier to formalize than others, as summarized in
Table 4.
The results in this subsection are reported as counts and percentages of cases mapped to each formal model, as shown in
Table 5. This helps demonstrate which NP-complete structures occur most frequently in the target domain. In many policy-oriented corpora, requirement coverage and admissible selection are expected to be dominant, which would correspond to set cover and set packing or independent set formulations. Numeric threshold cases may align with subset sum, while conflict-heavy regulatory cases may naturally produce graph-based formulations.
The depth of the mapping analysis can be illustrated by considering how task structure influences formulation choice. Set cover dominated because many policy queries required the joint satisfaction of all mandatory conditions. Graph-based formulations were frequent in cases with explicit incompatibility relations, especially when mutually exclusive administrative options or conflicting rule applications had to be modeled. Threshold and partition formulations appeared less often, but they were particularly important in eligibility, budgeting, and allocation scenarios where numeric conditions constrained admissible outcomes.
The mapping results show that set cover is the dominant formal model, which is expected because many policy and compliance tasks require covering all required obligations with a minimum set of documents, conditions, or evidentiary elements. Graph-based formulations, including independent set and vertex cover, collectively account for 30.8% of cases, indicating that incompatibility and conflict-resolution structures are also common in legal-policy reasoning. Numeric threshold tasks represented by subset sum and partition account for 21.2% of cases.
4.4. End-to-End Decision Performance
The most important level of the evaluation concerned the end-to-end performance of the full system. At this stage, the framework was assessed as a complete decision-support pipeline, beginning with a legal-policy query and ending with a formal decision output. The evaluation covered four representative task families: eligibility assessment, compliance verification, requirement coverage, and conflict-free policy selection in
Table 6.
Table 5 shows that the strongest end-to-end performance was achieved in eligibility assessment, where the framework reached an accuracy of 0.872 and an F1-score of 0.871. This result is consistent with the higher extraction performance observed for explicit eligibility and threshold constraints. Compliance verification also performed strongly, with an F1-score of 0.843. Requirement coverage and conflict-free policy selection were more difficult, achieving F1-scores of 0.826 and 0.802, respectively. These tasks depend more heavily on complete evidence retrieval, correct identification of interacting constraints, and accurate mapping to set cover, set packing, independent set, or vertex cover formulations.
The macro-average accuracy of the full framework was 0.836. This result indicates that the proposed architecture was effective not only in retrieving relevant text and extracting policy elements, but also in producing valid downstream decision structures. The end-to-end results across the four evaluated legal-policy task families are summarized visually in
Figure 3.
Figure 3 shows that the framework performs best on eligibility assessment and weakest on conflict-free policy selection. This pattern is consistent with the error analysis, since conflict-oriented tasks depend more heavily on correct incompatibility detection and graph-based formalization.
Overall, the results indicate that the framework remains effective not only as a retrieval system but also as a decision-oriented pipeline.
4.5. Comparative Analysis with Baseline Variants
To estimate the contribution of each major framework component, the full system was compared with reduced variants in
Table 7. This analysis was intended to determine whether the integration of dense retrieval, structured extraction, and formal combinatorial modeling produced measurable gains over simpler alternatives. In addition to lexical baselines, a dense transformer retrieval baseline was included in order to separate the effect of semantic encoding from the contribution of structured extraction and formal decision modeling.
The ablation results presented in
Table 6 indicate that each component of the architecture contributes to final performance. The transition from lexical retrieval to MPNet retrieval increased end-to-end accuracy from 0.671 to 0.723, showing that semantic evidence retrieval alone improved downstream reasoning. Adding structured extraction without formal mapping further increased accuracy to 0.781, indicating that transforming retrieved passages into explicit normative units is beneficial even before full combinatorial modeling is introduced. The full framework reached an accuracy of 0.836, producing a gain of 5.5 percentage points over the retrieval-plus-extraction variant and a gain of 16.5 percentage points over the lexical-only baseline.
The observed improvements were also stable under bootstrap-based confidence estimation, which supports the conclusion that the gain of the full framework over reduced variants is methodologically robust.
These results show that the formal mapping layer is not merely interpretive. It has measurable operational value and improves decision quality beyond retrieval and extraction alone. To clarify the contribution of each major component, the cumulative effect of adding semantic retrieval, Legal NLP extraction, and formal combinatorial decision modeling is illustrated in
Figure 4.
As shown in
Figure 4, each additional layer contributes positively to the final decision accuracy. The largest overall gain is achieved after combining semantic retrieval with structured extraction and formal combinatorial mapping, which confirms that the framework’s performance depends on the interaction of all major components rather than on retrieval alone.
In addition to metric-based ablation, the proposed framework was compared with reduced system variants in terms of structural functionality and interpretability.
Table 8 shows that the proposed framework differs from reduced variants not only in accuracy but also in its ability to support structured extraction, formal reasoning, and interpretable decision construction.
Statistical and Comparative Interpretation
The comparative pattern across
Table 5 and
Table 6 is consistent across all evaluated task families: each additional framework layer contributes positively to final task performance. To assess stability, the main retrieval and end-to-end metrics were additionally examined using 1000 bootstrap resamples over the evaluation partition. The resulting 95% confidence intervals remained consistent with the reported ranking of system variants and confirmed the robustness of the reported gains.
To evaluate the stability of the comparative results, bootstrap-based confidence intervals were computed for the main end-to-end accuracy scores. The results are presented in
Table 9.
The bootstrap intervals confirm that the full framework remains the strongest system variant across the evaluated settings and that the comparative ranking of the methods is stable.
4.6. Error Analysis
In the present framework, errors may arise at several stages in
Table 10. Retrieval errors occur when semantically relevant passages are not included in the top-K evidence set. Extraction errors arise when obligations, exceptions, thresholds, or incompatibilities are incorrectly identified or omitted. Formal mapping errors occur when the extracted constraint structure is assigned to an inappropriate NP-complete model. End-to-end decision errors result when one or more upstream mistakes propagate into the final output.
The largest single source of failure is extraction error, followed by retrieval miss. Together, these two categories account for 46.0% of all observed errors, showing that improvements in evidence capture and structured extraction would have the greatest downstream impact. The relative contribution of the major error categories is shown in
Figure 5.
Figure 5 indicates that extraction error and retrieval miss are the two dominant sources of failure. Together, they account for nearly half of all observed errors, suggesting that future improvements should primarily focus on more accurate evidence retrieval and richer legal-policy extraction.
To examine how the error profile changes across task families, the task-wise distribution of error types is presented in
Figure 6.
Figure 6 shows that extraction errors dominate eligibility, compliance, and coverage settings, whereas conflict detection becomes especially important in conflict-free policy selection. This confirms that different downstream task families stress different components of the pipeline.
Representative Error Cases
A representative retrieval error occurred when the main obligation was retrieved, but the relevant exception clause was ranked below the top-K threshold. In such cases, the downstream formal model was constructed from incomplete evidence, which led to false-positive eligibility outcomes.
For example, the clause “Applicants are eligible if residence is confirmed, except where temporary registration is pending” produced a retrieval error when the main eligibility condition was retrieved, but the exception clause was ranked outside the top-5 evidence set. The resulting formal representation incorrectly preserved the inclusion condition while omitting the exception, which led to a false-positive eligibility decision.
A representative extraction error occurred when a temporal qualifier was attached to the wrong obligation, causing an incorrect threshold or deadline constraint.
A representative extraction error was observed for the clause “Income confirmation must be submitted not later than ten working days after notification,” where the temporal qualifier was attached to the notification event rather than to the submission requirement.
A representative mapping error occurred when a conflict-heavy task was assigned to a coverage formulation instead of a graph-based formulation, which reduced the adequacy of the final decision model.
4.7. Qualitative Case Illustration
A representative case involved a policy query asking for the minimum evidence set required to establish eligibility for a regulated administrative procedure. The retrieval module returned passages concerning identity verification, residency status, income thresholds, and deadline requirements. The extraction layer converted these passages into four eligibility constraints, one exclusion constraint, and one temporal constraint. The formal modeling layer mapped the task to a set cover instance with 6 required policy conditions and 9 candidate document groups. The solver returned a minimum valid covering set of 3 document groups, which matched expert judgment.
A service-level view of the implemented reasoning workflow is presented in
Figure 7.
Figure 7 shows how retrieval, extraction, orchestration, formal reasoning, and explanation modules interact within the implemented system.
To illustrate the practical implementation of the proposed framework,
Figure 8 presents the interface of the LexIR prototype. The system integrates semantic retrieval, structured constraint processing, and decision-oriented output generation within a single user-facing environment. In this example, the user submits a constitutional query in Kazakh, and the system returns a concise answer together with supporting legal reasoning and relevant constitutional references. The user query shown in
Figure 8 is: “мeн интepнeттe бoc cөйлeй aлaмынбa,” which may be translated as “Am I free to speak openly on the Internet?” The system returned a concise constitutional answer indicating that freedom of expression is protected in principle, while also noting that legal restrictions may apply in specific cases, such as protected confidential information or other statutory limitations.
A simplified English rendering of the returned answer is as follows: “Yes, freedom of expression is constitutionally protected, and censorship is prohibited. However, this right is subject to legal limitations in specific cases defined by law.” This example illustrates how the system links user-facing legal guidance to retrieved constitutional evidence, structured constraint processing, and explanation-oriented output.
Figure 8 shows how the proposed framework can be presented in an applied setting. The interface exposes the main components of the system, including orchestration, hybrid retrieval, neuro-symbolic reasoning, and evidence-based response generation. This prototype view is important because it demonstrates that the framework is not limited to abstract modeling but can also support interpretable legal-policy interaction in a practical environment.
In this case, the retrieved evidence included one primary eligibility clause, one supporting evidentiary clause, one temporal restriction, and one exclusion condition. The extracted representation was mapped to a set cover formulation because the decision required identifying the minimum set of evidentiary elements that jointly satisfied all required conditions. The final output remained traceable because each selected document group could be linked back to specific source passages.
4.8. Summary of Results
Overall, the results indicate that the proposed framework is effective across all three levels of evaluation. MPNet-based retrieval substantially improves the ranking of legally relevant passages, structured extraction achieves strong overall quality with an F1-score of 0.818, and the end-to-end decision layer reaches a macro-accuracy of 0.836. The best performance is observed in eligibility-oriented scenarios, while conflict-heavy tasks remain the most difficult. These findings support the usefulness of combining semantic retrieval, Legal NLP, and formal combinatorial decision modeling within a single legal-policy reasoning framework.
5. Discussion
5.1. General Interpretation of the Results
The results show that the proposed framework is most effective when all components work together. The retrieval module improves the evidence base, the extraction module converts retrieved passages into structured policy elements, and the formal modeling module transforms these elements into explicit decision structures. When one of these parts is removed, final performance declines. This indicates that the framework should be understood as an integrated pipeline rather than as a collection of independent methods.
A key finding of the study is that semantic retrieval plays a major role in the overall system. The MPNet-based approach performed better than the lexical baselines across all reported retrieval metrics. This result is important because legal and policy texts often describe similar rules in different words. A purely lexical method may fail when the query and the relevant passage do not share the same surface vocabulary.
Dense retrieval is more suitable in such cases because it captures semantic similarity rather than exact word overlap.
The extraction stage also had a strong effect on the final output. The framework performed best on eligibility and threshold constraints. These categories are often written in a more direct and stable form. By contrast, dependency and conflict constraints were more difficult. These elements often depend on broader context, indirect wording, and interaction between several clauses. This explains why conflict-oriented tasks remained harder at the end-to-end level.
5.2. Discussion of Retrieval and Extraction Results
The retrieval results suggest that the framework benefits from moving beyond keyword-based legal search. In many legal-policy settings, the main problem is not only to find text that looks similar to the query, but to recover text that carries the same legal effect. The gain achieved by MPNet and by the reranking stage supports this view. Better retrieval improves the evidence base, and this improvement propagates through the rest of the pipeline. The extraction results reveal a more uneven pattern. Eligibility rules, explicit thresholds, and clearly stated exclusions were extracted with relatively high quality. This is encouraging because these elements often form the core of administrative and policy decisions. However, the lower performance on conflict and dependency constraints shows that some types of legal structure are still harder to model reliably. This is not surprising. Such constraints are often distributed across multiple sentences or expressed through exceptions, references, or conditions that are not fully explicit in one passage alone. These observations are important because they show that the main difficulty is not only in language understanding at the sentence level. It also lies in the reconstruction of logical relations between legal statements. In other words, the problem is partly linguistic, but it is also structural.
5.3. Discussion of the Formal NP-Complete Mapping
One of the main contributions of the framework is the mapping of extracted legal-policy constraints to canonical NP-complete formulations. The mapping results show that set cover was the most frequent model. This suggests that many legal and policy tasks are naturally expressed as coverage problems. In practical terms, this often means finding the smallest set of documents, actions, or conditions that satisfies all stated requirements. The relatively large number of graph-based cases is also meaningful. The presence of independent set and vertex cover instances indicates that incompatibility and conflict are common features of legal-policy reasoning. This is important because it shows that formal model selection is not only a theoretical exercise, but also a practical step that determines how legal constraints can be verified computationally. This supports the idea that formal graph models are not merely theoretical additions, but useful tools for representing real policy constraints. The subset sum and partition cases were fewer, but still important. These formulations capture numeric thresholds and balancing conditions. Such structures appear in budget rules, eligibility cutoffs, and allocation settings. Their presence in the results confirms that legal-policy reasoning is not limited to symbolic rule matching. In many cases, it also includes quantitative conditions that must be satisfied jointly. Taken together, the mapping results suggest that legal-policy reasoning can often be formalized through a small set of well-known combinatorial models. This does not mean that every legal task can be reduced to one of these models. It means that a meaningful subset of policy-oriented tasks can be represented in a clear and computationally useful way.
The purpose of this mapping is therefore structural rather than metaphoric: each selected formulation is intended to preserve the dominant decision pattern implied by the extracted constraints.
5.4. Practical Value of the End-to-End Framework
The end-to-end results indicate that the framework has practical potential for decision support. The strongest performance was observed in eligibility assessment. This is an important result because eligibility is one of the most common tasks in public administration, compliance workflows, and policy implementation. In such settings, the ability to retrieve relevant evidence, extract rule elements, and return a structured decision can reduce manual effort and improve consistency. The lower performance on conflict-free policy selection is also informative. This task requires correct retrieval, correct extraction, and correct graph construction at the same time. Even a small upstream error may change the final formal structure. For this reason, conflict-oriented tasks remain more sensitive to pipeline weaknesses than direct requirement-checking tasks. This does not reduce the value of the framework. Instead, it helps identify where further improvement is needed. The ablation study strengthens this interpretation. Retrieval alone improves the baseline. Retrieval combined with extraction improves it further. Formal combinatorial modeling then adds another clear performance gain. This means that the final reasoning layer is not only theoretically motivated. It also improves real decision quality. In this sense, the framework does not stop at text understanding. It moves from text understanding to structured decision support.
5.5. Error Analysis and Main Sources of Failure
The error analysis points to two main sources of weakness: retrieval miss and extraction error. If the correct passage is not retrieved, the later stages cannot recover the missing evidence. If the correct passage is retrieved but the key normative element is extracted incorrectly, the formal model becomes incomplete or distorted. These two categories account for the largest share of total errors and therefore represent the main bottlenecks of the current framework. The task-specific analysis adds a more detailed picture. In eligibility, compliance, and coverage tasks, extraction errors were the dominant problem. In conflict-free selection tasks, conflict detection became much more important. This is a useful result because it shows that not all task families stress the same part of the system. A single improvement strategy will therefore not be sufficient for all use cases. Some tasks require stronger semantic retrieval, others require better extraction of exceptions and dependencies, and still others require better formal graph construction. This finding also supports the idea that evaluation should be multi-level. A framework of this kind cannot be judged only by retrieval metrics or only by final accuracy. Both are needed, and the relation between them must also be examined. This also indicates that future evaluations should report larger sets of annotated case-level examples in order to better characterize how upstream errors propagate into formal decision outputs. The bootstrap-based stability check also suggests that the reported performance differences are structurally consistent, although broader cross-domain testing is still necessary.
5.6. Transparency, Interpretability, and Scope
One advantage of the proposed framework is that it provides interpretable intermediate outputs. The retrieved passages can be inspected. The extracted constraints can be checked. The selected formal model can also be reviewed. This makes the system more transparent than a pure black-box predictor. In legal and policy settings, such transparency is important because users need to understand why a result was produced, not only what result was produced. This traceability is especially relevant in legal settings, where users may need to inspect both the supporting evidence and the formal basis of a decision. At the same time, the scope of the framework should be stated clearly. The present study does not claim that all legal reasoning can be reduced to NP-complete models. Legal interpretation often involves ambiguity, institutional context, discretion, and domain-specific judgment. These aspects cannot always be represented by a single formal structure. The framework is therefore best understood as an applied tool for a specific class of tasks, especially those where the decision depends on explicit requirements, thresholds, exclusions, or incompatibility relations. This limitation should not be viewed as a weakness of the approach alone. It reflects the broader nature of legal reasoning itself. Some parts of the law are highly structured and suitable for formal modeling. Other parts are more open-ended and interpretive. The framework is designed for the first type of setting.
5.7. Limitations and Future Work
The present study emphasizes formalization, interpretability, and decision-structure alignment rather than worst-case complexity analysis or exact optimization of general combinatorial instances.
Several limitations should be acknowledged. First, the final decision quality depends strongly on the quality of the corpus and on the segmentation strategy. If a legally important statement is split across passages or omitted during preprocessing, later stages may be affected. Second, the quality of the framework also depends on the extraction layer. Exceptions, nested conditions, and cross-references remain difficult to normalize correctly. Third, the current approach assumes that a suitable NP-complete formulation can be selected for the extracted constraint set. In complex cases, more than one formal model may be plausible. Another limitation is that the current empirical validation focuses on a bounded legal-policy corpus and a finite set of canonical task families. Broader cross-domain validation is still needed.
Future work should address these issues in several ways. One direction is to improve multi-hop retrieval and cross-reference handling. Another is to strengthen the extraction of dependency and conflict relations. A third direction is to develop richer mapping strategies in which a task may be represented through hybrid or multi-stage formal models rather than a single formulation. It would also be useful to evaluate the framework on broader legal domains, multilingual corpora, and more complex regulatory settings.
6. Conclusions
This study introduced an applied framework for turning legal and policy text into formal decision models. The framework combines MPNet-based semantic retrieval, Legal NLP and policy information extraction, and mapping to canonical NP-complete problem formulations. Its main purpose is to move beyond simple legal search and support structured decision-oriented analysis. The results show that the proposed approach is most effective when all parts of the pipeline are used together. The retrieval layer improves access to relevant legal evidence. The extraction layer converts that evidence into structured policy elements such as eligibility rules, exclusions, thresholds, dependencies, and conflicts. The formal layer then maps these elements to decision models such as set cover, set packing, subset sum, partition, independent set, and vertex cover. The ablation results confirm that each stage adds measurable value to the final outcome.
The study also shows that legal-policy tasks are not uniform. Eligibility assessment produced the strongest results, while conflict-free policy selection was more difficult. The error analysis helps explain this difference. Retrieval miss and extraction error were the largest sources of failure, while conflict detection played a central role in graph-based tasks. These findings show where future improvements are most needed. An important conclusion of the study is that a meaningful part of legal and policy reasoning can be represented through a limited set of well-known combinatorial models. This should not be read as a claim about all legal reasoning. Many legal tasks remain open, interpretive, and context-dependent. However, for structured policy settings with explicit rules and constraints, the proposed framework provides a useful and transparent computational approach.
The practical value of the framework lies in its ability to connect three levels of analysis: semantic retrieval, structured extraction, and formal reasoning. Because the intermediate outputs can be inspected, the system also supports a higher degree of transparency than a purely black-box decision model. This is especially important in legal and administrative settings, where users need to understand both the result and the basis for that result.
Future work should focus on stronger handling of exceptions, dependencies, and cross-references, as well as better modeling of incompatibility relations and multi-step evidence retrieval. It will also be important to evaluate the framework on larger and more diverse legal corpora, including multilingual policy collections and more complex regulatory settings.
Overall, the study shows that semantic retrieval, policy information extraction, and NP-complete formal modeling can be combined in a single framework for legal-policy decision support. In this sense, the proposed approach offers a practical bridge between legal text analytics and formal computational reasoning.