Exploring Legislative Textual Data in Brazilian Portuguese: Readability Analysis and Knowledge Graph Generation

Oliveira, Gisliany Lillian Alves de; Santos, Breno Santana; Silva, Marianne; Silva, Ivanovitch

doi:10.3390/data10070106

Open AccessArticle

Exploring Legislative Textual Data in Brazilian Portuguese: Readability Analysis and Knowledge Graph Generation

by

Gisliany Lillian Alves de Oliveira

¹

,

Breno Santana Santos

²

,

Marianne Silva

³

and

Ivanovitch Silva

^1,*

¹

UFRN-PPgEEC, Postgraduate Program in Electrical and Computer Engineering, Federal University of Rio Grande do Norte, Natal 59078-970, Brazil

²

Information System Department, Federal University of Sergipe, Itabaiana 49400-000, Brazil

³

Campus Arapiraca, Federal University of Alagoas, Penedo 57200-000, Brazil

^*

Author to whom correspondence should be addressed.

Data 2025, 10(7), 106; https://doi.org/10.3390/data10070106

Submission received: 24 April 2025 / Revised: 7 June 2025 / Accepted: 23 June 2025 / Published: 1 July 2025

Download

Browse Figures

Versions Notes

Abstract

Legislative documents are crucial to democratic societies, defining the legal framework for social life. In Brazil, legislative texts are particularly complex due to extensive technical jargon, intricate sentence structures, and frequent references to prior legislation. The country’s civil law tradition and multicultural context introduce further interpretative and linguistic challenges. Moreover, the study of Brazilian Portuguese legislative texts remains underexplored, lacking legal-specific models and datasets. To address these gaps, this work proposes a data-driven approach utilizing large language models (LLMs) to analyze these documents and extract knowledge graphs (KGs). A case study was conducted using 1869proposals from the Legislative Assembly of Rio Grande do Norte (ALRN), spanning January 2019 to April 2024. The Llama 3.2 3B Instruct model was employed to extract KGs representing entities and their relationships. The findings support the method’s effectiveness in producing coherent graphs faithful to the original content. Nevertheless, challenges remain in resolving entity ambiguity and achieving full relationship coverage. Additionally, readability analyses using metrics for Brazilian Portuguese revealed that ALRN proposals require superior reading skills due to their technical style. Ultimately, this study advances legal artificial intelligence by providing insights into Brazilian legislative texts and promoting transparency and accessibility through natural language processing techniques.

Keywords:

legislativedocuments; knowledge graphs; large language models; laws; readability analysis; exploratory data analysis; natural language processing

1. Introduction

The legislative branch is a core part of democratic societies, tasked with drafting, debating, and enacting laws that govern a nation [1]. Legislative documents, such as bills, statutes, and amendments, contain the decisions and priorities of policymakers, reflecting social values and addressing significant public concerns [2]. Yet, these documents are often textually complex, as they are written in formal, technical, and precise legal jargon [3]. While indispensable for legal certainty, this precision often makes such texts incomprehensible to the general public, limiting their understanding and engagement with the legislative process [4].

In the context of Brazilian Portuguese, the linguistic and structural traits of the language also contribute to the complexity of legislative documents. Despite the growing interest in applying natural language processing (NLP) techniques to legal texts, research remains predominantly focused on English documents [5]. This leaves legislative texts in underrepresented languages, such as Brazilian Portuguese, relatively unexplored [5]. Consequently, there is a need for tools and methods to analyze these legal documents within their proper linguistic and cultural contexts [5,6].

Hence, bridging this gap reveals opportunities for leveraging legislative texts as data to foster various applications [7]. For example, exploring legal documents can support the evaluation of a law’s readability while also addressing issues of accessibility and transparency [4]. Furthermore, extracting structured knowledge from unstructured legislative texts can support policy interpretation, legal research, and the development of tools to assist legislative activities [8].

Despite these potential benefits, the analysis of legislative texts remains challenging due to the intricate nature of legal language, the need for domain-specific expertise, and the considerable length of the documents [6]. Moreover, research on this topic in Brazilian Portuguese demands greater effort, given the limited availability of annotated datasets and pre-trained models for legal texts in this language [5].

One promising avenue for addressing these limitations is the adoption of knowledge graphs (KGs), which are dynamic graph data structures consisting of a set of typed entities, their attributes, and meaningfully named relationships [9]. KGs consist of nodes representing the entities of interest and the edges defining the relationships between these entities [10]. By capturing hierarchical relationships within complex textual data, KGs provide expressive and structured representations that can benefit a wide range of NLP tasks [11,12].

Complementarily, large language models (LLMs) can play a central role in KG construction, primarily through advanced prompt engineering techniques. By leveraging the ability of LLMs to understand context and generate coherent outputs, these models can assist in extracting entities, relationships, and attributes directly from texts [13]. Hence, combining the strengths of LLMs and KGs may significantly enhance the analysis of legislative texts, creating opportunities for improved legal services and fostering the development of intelligent systems (e.g., summarization, similarity search, legal reasoning) [9].

Considering the aforementioned challenges and opportunities, this study aims to contribute to the field of legislative analysis by providing a data-driven and LLM-supported methodology for extracting insights and generating KGs from legislative documents. Furthermore, the proposed approach was experimentally validated through a case study using a dataset of Brazilian legislative documents sourced from the Legislative Assembly of Rio Grande do Norte (ALRN) [14]. Specifically, this work presents (i) an exploratory data analysis (EDA) to uncover patterns and characteristics in the text; (ii) a readability assessment applying the Flesch Reading Ease index and the Dale–Chall Readability formula; and (iii) an LLM-supported methodology to extract structured information as knowledge graphs from legislative texts.

Therefore, this paper contributes to legal artificial intelligence by focusing on legislative texts in Brazilian Portuguese, a relatively under-resourced language in this field. It introduces a curated dataset of Brazilian legislative documents and presents insights into legislative trends, including temporal productivity, thematic priorities, textual intelligibility, and knowledge representation. These resources and findings open pathways for future research in areas such as similarity search, information retrieval systems for policy investigation, automatic summarization, and knowledge graph construction to visualize relationships within the legislative corpus.

The remainder of this article is organized as follows. Section 2 presents a literature review, whereas Section 3 details the proposed approach. Next, in Section 4, the empirical evaluation of our solution is explained. In addition, Section 5 discusses the main findings. Next, Section 6 details the threats to the validity of our study. Finally, Section 7 presents the concluding remarks and outlines directions for future studies.

2. Related Works

This section focuses on identifying the existing literature related to exploring legal texts (mainly within the legislative domain), especially that in Brazilian Portuguese. Considering this type of data, the related studies are discussed from two different perspectives: (i) NLP and readability assessment; and (ii) the use of LLMs to extract knowledge graphs.

2.1. NLP and Readability Analysis of Legal Texts

With the digitization of legal sources, particularly legislative ones, some projects have begun exploring the concept of “Law as Data” to extract data from legal texts and improve information retrieval using semantic and legal ontologies [15].

In Brazil, the Chamber of Deputies and several universities have developed the Ulysses Project to increase transparency, improve the Chamber’s engagement with citizens, and support legislative activities through complex analyses [16]. The project has already delivered various solutions, including named entity recognition (NER) [17], document and information retrieval based on relevance [16,18,19,20,21], and the analysis of popular comments on legislation [22,23]. These solutions rely on legislative project data and public opinion comments, both of which are provided by the Chamber. Generally, the authors employ well-established approaches in the literature, comparing statistical methods, classical models (e.g., n-gram models, logistic regression, naïve Bayes), neural networks, and transformer-based models and tools (e.g., BERT, BERTimbau, BERTopic, multilingual BERT), depending on the task at hand.

Regarding the readability analysis, a few studies have focused on legal documents. One example is a study on simplifying Portuguese legal texts using documents produced by the Brazilian judiciary [4]. This study evaluated four approaches to automatic text simplification (ATS): the unsupervised models MUSS(EN) and MUSS(PT) and two supervised methods using transformers and neural machine translation. The dataset comprised 200 summaries from two prominent Brazilian federal courts. To measure the effectiveness of these methods, the Flesch Reading Ease index was applied, revealing that the simplified texts achieved higher readability scores compared to the original versions, thus making them more readable.

In another study [6], the authors developed the RegBR framework to enhance the classification and analysis of industry-specific regulations within the Brazilian government. The authors developed a centralized database of Brazilian federal legislation by employing automated routines of extract, transform, and load (ETL), alongside data mining and machine learning techniques. They utilized state-of-the-art NLP methods to classify regulatory texts according to their economic sectors, including statistical models (e.g., logistic regression, support vector machine), word embedding models (e.g., word2vec, Glove), neural networks, and transformer-based models. The evaluation metrics included linguistic complexity, restrictiveness, industry citation relevance, and a measure of interest, which were validated against historical regulatory changes in Brazil. Specifically considering the linguistic complexity, they applied median sentence length, Shannon’s entropy, and the frequency of conditional words in the text.

These studies demonstrate the growing interest in leveraging NLP techniques to analyze legal texts and create solutions to assist activities in this area. They also highlight the potential of readability metrics for evaluating text simplification and classification, addressing the challenges of interpreting complex legislative and regulatory documents.

2.2. Knowledge Graphs and LLMs Applied to Legal Documents

Knowledge graphs have emerged as a tool for structuring unstructured legal texts, enabling semantic search, relationship discovery, and advanced analytics. Existing studies have focused on extracting entities and relationships from legal documents, often relying on rule-based systems or traditional machine learning models [24,25,26]. However, in recent years, numerous works have leveraged the synergistic relationship between large language models and knowledge graph-based solutions, highlighting their mutual benefits and complementary strengths [27,28,29,30,31].

In one study [30], the authors proposed a joint knowledge enhancement model (JKEM) to construct a Chinese legal knowledge graph (CLKG). By embedding prior legal knowledge into an LLM and fine-tuning with a prefix-tuning approach, they preserved most model parameters while optimizing prefix embeddings for enhanced knowledge extraction. Using a legal knowledge corpus (The Criminal Law of the People’s Republic of China: Annotated Code), JKEM achieved an accuracy of 90.76%, recall of 91.05%, and F1 score of 90.90%, outperforming CRF, BiLSTM, BERT, and ChatGLM-6B. However, limitations were noted in recognizing case entities.

Another study introduced LeGalFormer [29], a model for legal similar case retrieval (LSCR) that combines graph representation learning with a transformer architecture. By encoding cases into a legal case embedding graph (LCEG) and employing a graph transformer, the model addressed pre-trained language models’ challenges with lengthy, complex legal texts. Evaluated on 3113 annotated legal cases, LeGalFormer achieved approximately 5% improvement in Precision and NDCG (normalized discounted cumulative gain) metrics over traditional and fine-tuned models. Nonetheless, the study identified limitations, such as over-smoothing in graph neural networks, and persistent challenges with the complexity of legal text.

A microservices-based architecture designed for complex linguistic tasks in the legal domain was recently proposed [27], integrating LLMs with KGs. The approach combined document retrieval, KG integration, and specialized LLM refinement to enhance legislative reference extraction from legal texts. Using unstructured legal documents, the system constructed domain-specific and constraint KGs, leveraging zero-shot prompt engineering for effective information extraction. Results showed improved coherence and correctness, validated through a microservice that ensured compliance with the constraint KG, reducing hallucinations. However, the KG creation microservice requires human validation to develop the constraint KG fully, limiting scalability and efficiency in practical applications.

Regarding legislative documents, there is a solution named Legis AI Platform to enhance legislative processes, mainly focusing on Italian legislation [28]. The methods employed included the creation of a legislative property graph, which organized laws and their interconnections, and the integration of LLMs such as LLama-3 for report generation and analysis. The data utilized consisted of laws extracted from Normattiva, the official source of Italian laws, which were enriched through an ETL pipeline to capture relevant properties of nodes and edges in the graph. The results indicated that the platform effectively supported lawmakers by providing tools for comparing the quality of new laws against existing legislation, thereby facilitating informed decision-making. However, limitations were noted, particularly concerning the potential for LLMs to produce hallucinations, which could undermine the reliability of generated legal documents. The study emphasized the importance of ensuring that the models remain neutral and do not provide recommendations, thus maintaining the integrity of the legislative drafting process. Future work would aim to validate the platform’s effectiveness through user studies and empirical evaluations, focusing on continuous improvement and user engagement.

Considering Brazilian Portuguese, only one study proposed an RDF-based graph to represent and query sections of Brazilian legal norms [26]. The proposed graph was grounded in an ontological view, enabling the description of a legal system’s general structure and the organization of legal documents. The authors formulated several SPARQL queries to retrieve sections of legal documents related to specified sets of words, and the results were significant.

The reviewed literature highlights the advancements in analyzing legal texts through readability metrics, NLP techniques, and knowledge graph-based approaches. However, significant gaps remain, as no studies specifically target the legislative domain in Brazilian Portuguese and leverage state-of-the-art LLMs for knowledge graph construction and reasoning. Given the promising results of existing research on NLP for legal text analysis and the potential of LLMs for knowledge representation, this presents a clear opportunity for further in-depth study in this area.

3. Proposed Methodology

This section outlines the proposed approach for extracting knowledge graphs from legislative texts. Figure 1 provides an overview of the methodology. The pipeline begins with text sanitization—a step designed to ensure data consistency and quality—which generates a cleaned and normalized corpus from multiple legislative documents. Subsequently, in the KG extraction step, a large language model analyzes the corpus to extract entities and relationships relevant to the legislative context. Next, the postprocessing phase performs tasks aiming to ensure the consistency of the extracted entities and generates the necessary code for storing the KGs in a graph database. Once constructed, the KGs serve as a robust foundation for developing innovative solutions within the legislative domain. In addition to this main pathway, the methodology supports exploratory data analysis to identify key patterns and distributions within the corpus. This can be further enriched through readability analysis, which provides insights into the complexity of legislative texts.

The following sections provide a more detailed explanation of each step in the methodology, beginning with textual sanitization and progressing through KG extraction and the postprocessing. Other analysis steps, including exploratory data analysis and readability evaluation, will be thoroughly discussed to highlight their role in uncovering patterns and assessing text complexity.

3.1. Text Sanitization

The pipeline begins by sanitizing the document text to address inconsistencies and prepare the data for textual analysis and KG extraction. Encoding errors are corrected using the ftfy Python library [32], restoring improperly encoded characters. Subsequently, HTML tags are stripped from the text using regular expressions (regex), which ensures cleaner formatting. Additionally, isolated diacritics and misplaced special characters are removed through regular expression (regex) patterns. Next, leading, trailing, and multiple consecutive spaces are removed, producing a standardized text version. It is worth mentioning that punctuation is eliminated for textual statistics. Stop words are also removed using the NLTK package [33]. These preprocessing routines are necessary for reducing noise and artifacts that can compromise the reliability of the analysis and KG extraction.

3.2. KG Extraction

Once the corpus has been adequately preprocessed, the KG generation begins, aiming to build a knowledge graph that captures entities and relationships encapsulated in the legislative corpus. As LLMs typically exhibit outstanding performance in natural language understanding and generation [13], a model can be selected and utilized to extract relevant concepts and connections from unstructured textual data. This process involves designing a targeted prompt, in which the task is formulated in natural language (e.g., Portuguese) to guide the model’s output effectively.

Thus, KG extraction frequently involves several NLP techniques, such as NER and relation extraction (RE), which form the foundation of this process [9]. NER is an information extraction sub-task of NLP that aims to identify and classify entities in text, such as names, locations, and organizations [34], used to detect entities and concepts of the KGs. On the other hand, relation extraction is another NLP task involving the extraction of connections between entities within a text [35].

Other approaches, namely, entity linking, coreference resolution, entity disambiguation, or event extraction, further enrich the final KG [36,37]. In this study, besides the core tasks (NER and RE), coreference resolution—to identify all the references to the same entity, such as pronouns [36]—and entity disambiguation—to solve ambiguities in entity mentions [9]—are also considered when formulating the prompt.

Next, to guide the model in extracting knowledge, a custom prompt must be designed to elicit named entity recognition and disambiguation, relation extraction, and coreference resolution. Depending on the legislative documents and their language, the prompt may be iteratively refined to ensure domain specificity and alignment with the legislative language.

Thus, the final prompt directs the LLM to produce a structured list of relations in the following format: [ENTITY 1, ENTITY TYPE 1, RELATION, ENTITY 2, ENTITY TYPE 2]. To maximize the extraction quality and completeness, the following instructions are provided:

The texts provided include laws, decrees, resolutions, amendments, motions, messages, and indications. Information should only be captured once in cases of redundancy within the text.
Entities must be categorized into specific types, including Person, Organization, Law, Location, Theme, Occupation, Event, Target Audience, Products/Services, Program, and Time.
Entities referred to by pronouns or variations in naming (e.g., “José da Silva”, “José”, “he”) must be resolved to their most complete and consistent form (e.g., “José da Silva”).
Relations must be explicit, concise, and directionally meaningful, typically derived from verbs or expressions within the text. Relationships include legislative actions such as “aprovar” (to aprove), “propor” (to propose), “revogar” (to revoke), and “decretar” (to decree).
The model is explicitly instructed not to infer or add information beyond what is present in the text.

Additionally, for complex tasks, it is also helpful to provide a representative example of input and expected outputs as part of the prompt [9]. Then, an example of a fictional law recognizing an environmental institute as a public utility, along with the expected output list, was provided. A full version of the prompt in Portuguese, designed according to established guidelines [38], is provided in Appendix A.

3.3. Postprocessing

After executing the model inference, the resulting array (the structured list previously mentioned) is processed to ensure consistency in the names of the identified entities. To address inconsistencies, the Levenshtein distance [39] is used as a similarity metric. Specifically, if an entity’s name contains another entity’s name and their Levenshtein distance is below a threshold of five (Levenshtein distance < 5), the longer name is retained for both entities. This threshold value was empirically determined to merge minor spelling variations in entity names while avoiding merging distinct entities, thus partially addressing the entity disambiguation challenge.

Finally, the final step utilizes a graph database to store the extracted entities and relations, allowing for easy querying and visualization of the information. The output array generated from the model inference can be programmatically converted into the appropriate code (e.g., Cypher) for database integration.

3.4. Exploratory Data and Readability Analyses

With the sanitized corpus, it is feasible to perform an exploratory textual analysis to extract insights and valuable information from the legislative texts for decision-making and/or improving any step of the proposed methodology.

These analyses can provide an overview of the activities of the selected legislative entity and enable the analysis of its corpus from several perspectives, such as thematic distributions, statistical characterization of the textual information, temporal trends, and linguistic complexity analysis using readability indicators.

Regarding readability analysis, it provides empirical evidence about the intrinsic complexity of legislative texts. To accurately assess this, it is essential to choose language-specific metrics that account for unique linguistic features, such as morphology and syntax. Moreover, readability analysis may guide future developments aimed at simplifying legal language or enhancing the comprehensibility of AI-generated representations of legal information.

4. Materials and Methods

This section presents a case study evaluating our approach, based on an experimental process consistent with methodological frameworks developed in previous research [40,41]. The following subsections focus on the definition and planning of this empirical evaluation. The last subsection presents its operational process.

4.1. Goal Definition

The main goal of this study is to evaluate the suitability and feasibility of the proposed methodology for data analysis, textual readability assessment, and LLM-supported knowledge graph extraction in the corpus of proposals from a legislative assembly.

4.2. Planning

This subsection details the case study design, comprising participant and artifact selection, research questions, instrumentation, and operation.

4.2.1. Participant and Artifact Selection

After defining this empirical evaluation goal, the process of selecting participants and objects began. Firstly, for convenience, ALRN was chosen as the provider of legislative documents. This decision is based on its prominent role as the primary legislative entity in the state of Rio Grande do Norte. Additionally, ALRN is recognized for its innovative contributions to developing and implementing pioneering AI-based solutions, which have significantly enhanced transparency, citizen engagement, and data accessibility, setting it apart from other legislative assemblies in Brazil [42].

After defining the legislative entity, ALRN provided a representative data sample. The data to be used in this study consist of 1869 entries of legislative proposals registered from 1 January 2019 to 21 April 2024. Each proposal is associated with a single document, capturing its most recent textual version. Consequently, each document’s class reflects its proposal’s status within the legislative process at the time of data collection. Moreover, the dataset predominantly includes proposals that require analogous matters analysis, a procedure that examines content similarities among documents to prevent conflicts or redundancies with existing legislation. Therefore, the dataset excludes routine legislative documents, such as minutes, reports, or orders.

In addition to the textual content, the dataset includes metadata associated with the proposals and their respective documents, comprising 15 columns (see Table 1 for a detailed description of all features).

Next, for the KG Extraction step, the Llama 3.2 3B Instruct model [43] was chosen for the following reasons: (i) its state-of-the-art performance in instruction-following; (ii) its context length of 128K tokens; (iii) its strong multilingual text generation capabilities despite its lightweight size [44]; and (iv) its suitability for the computational constraints of the target environment at ALRN.

Finally, for KG storage, the Neo4j database was adopted for the following reasons: (i) its graph-oriented database architecture, which naturally reflects the structure of KG entities and relationships [45]; (ii) its high performance in complex queries through the Cypher language [46], which is suitable for navigating and exploring interconnected data; and (iii) its active community, comprehensive documentation, and a robust ecosystem to perform graph analytics and graph data science tasks [47].

4.2.2. Research Questions

After defining the participants and artifacts, the research questions are structured as follows. The study is guided primarily by the goal of exploring LLM-assisted KG extraction in legal texts written in Brazilian Portuguese. Accordingly, RQ1 and RQ2 directly address the core methodological contribution of this work.

In turn, RQ3 to RQ7 refer to exploratory and readability analyses of the dataset itself. They are not outcomes of the LLM pipeline, but rather serve as complementary investigations (the alternative pathways indicated by dashed lines in Figure 1) that offer essential insights into the legislative corpus. These questions provide context to the data, support initial claims about its characteristics (e.g., verbosity, complexity), and help inform future improvements to legal NLP tasks.

RQ1: Can language models and prompt engineering effectively extract knowledge graphs from unstructured Brazilian Portuguese legislative texts?
RQ2: Is the proposed methodology suitable for supporting corpus characterization, textual readability evaluation, and KG generation?
RQ3: What is the distribution of legislative proposals by type?
RQ4: What are the most prevalent thematic areas in the legislative proposals?
RQ5: Based on the distribution of proposals, is it possible to detect patterns in legislative activities over time?
RQ6: Based on textual statistics, can it be determined how verbose legislative proposals are?
RQ7: Based on textual readability indicators, how readable are the texts of ALRN’s legislative proposals?

The metrics to evaluate these questions are (1) number of proposals; (2) character count; (3) word count; (4) average word length; (5) Flesch Reading Ease index, customized for the Brazilian Portuguese language [48]; and (6) an adapted and specialized version of the Dale–Chall Readability statistic for Portuguese [49]. The Flesch Reading Ease index and the adapted Dale–Chall Readability formula were chosen for this study because they were effectively adjusted and validated for Brazilian Portuguese, as supported by the NILC-Metrix system [49]. These adaptations ensure that the metrics capture the linguistic nuances of Brazilian Portuguese, which are critical for accurately assessing the readability of legislative texts in this language.

The Flesch Reading Ease index (F) assesses textual complexity based on average word and sentence lengths (see Equation (1)). The premise of this metric is that the shorter the words and sentences, the easier the text is to read [49]. In other words, the F index indicates the relative ease with which a reader can comprehend a given document. That is, the higher the metric value, the lower the textual complexity [49].

F = 248.835 - 1.015 \times (\frac{total of words}{total of sentences}) - 84.6 \times (\frac{total of syllables}{total of words})

(1)

Furthermore, the Dale–Chall Readability metric (D) combines the number of unfamiliar words with the average number of words per sentence [49]. For clarity, unfamiliar words are those not included in the Dictionary of Simple Words [50]. In this case, the higher the D metric value, the higher the textual complexity. Equation (2) describes the Dale–Chall Readability metric.

D = 0.1579 \times (\frac{unfamiliar words}{total of words} \times 100) + 0.0496 \times (\frac{total of words}{total of sentences}) + 3.6365

(2)

It is worth mentioning that both metrics—the Flesch Reading Ease and the adapted Dale-Chall Readability formula—are calculated using the NILC-Metrix system [49].

4.2.3. Instrumentation

The materials and resources used in this work are as follows:

Python Data Science ecosystem (pandas [51], NumPy [51], Matplotlib [52], seaborn [52] and others), provided by Anaconda platform [53] or Google Colab [54];
Anaconda’s JupyterLab;
Llama 3.2 3B Instruct model [43] and Hugging Face Transformers library [55], which supports prompt-based interaction with a model;
Neo4j [45,47], for modeling, visualization, analysis, manipulation, and persistence of KGs;
The corpus related to the legislative proposals of ALRN, previously discussed in Section 4.2.1 and available at Mendeley Data [14];
The Jupyter Notebooks and Python Scripts that contain all source code to perform the data analysis, which are available in the GitHub repository [56].

4.3. Operation

This subsection describes the preparation and execution of the empirical evaluation. The operational process began with configuring the computational environment on a local machine equipped with an NVIDIA RTX 3060 GPU with 6GB of VRAM, which was utilized in the case study.

The institution provided a representative sample of ALRN’s legislative proposals. Subsequently, the analysis pipeline was defined, as detailed in Section 3. Finally, the earlier discussed analysis process was performed (see Section 3), with the required artifacts outlined in Section 4.2.3.

After that, the analysis results were obtained based on the previously established metrics (see Section 4.2.2). It is worth noting that these findings serve to address the research questions posed in this study.

The results related to this empirical evaluation will be presented in the next section.

5. Results and Discussion

This section presents the results of the empirical evaluation previously described. It is divided into three parts: (i) exploratory data analysis; (ii) readability assessment; and (iii) KGs generation.

5.1. Exploratory Data Analysis

The EDA begins by analyzing proposal’s metadata. The purpose of this analysis is to present a detailed comprehension of the structure of the dataset, such as the yearly distribution of proposals, thematic areas, proposal types, character count, word length, and word count. It is worth mentioning that the dataset observations are in Brazilian Portuguese, but they were translated to English for better understanding. The legislative terms were standardized using the Brazilian National Congress Glossary [57].

As previously mentioned, the dataset includes proposals that require analysis of analogous matters, which can be observed by grouping them by type, as illustrated in Figure 2. Among these, the Bill of Law (“Projeto de Lei”) category dominates the dataset, with 1396 occurrences. Since the Bill of Law represents the typical legislative act, characterized by its generality and abstraction [58], it frequently involves the analysis of similar matters. It is expected to constitute the majority of legislative proposals.

In contrast, other proposal types—such as Constitutional Amendment Bills (“Projetos de Emenda Constitucional”), Bills of Supplementary Law (“Projetos de Lei Complementar”), or Bills of Legislative Decree (“Projetos de Decreto Legislativo”)—are more specialized and arise only when needed. Requests (“Requerimentos”) rank as the second most common type of proposal in Figure 2, indicating their procedural nature and frequent use in initiating legislative actions. Although Requests are the most common type of proposal internally at ALRN, this is not reflected in the chart because the dataset focuses on proposals with substantial legislative content, which require the analysis of textual similarity. As a result, only Requests related to solemn sessions or parliamentary fronts were included in the dataset. Finally, the Bills of Resolution (“Projetos de Resolução”) rank third, primarily addressing internal legislative matters.

Another perspective is the distribution of legislative proposals across various thematic fields, as shown in Figure 3. Education leads with 544 proposals, followed by Work (435) and Public Administration (236), expressing their strategic positions on the legislative agenda. Other notable areas include Tribute (158), Health (145), and Human Rights (52), presenting diverse legislative priorities. On the other hand, themes such as Tourism, Animal Cause, and Sport show minimal legislative focus, with only two proposals each. This distribution prioritizes fundamental societal issues, such as education and labor, while areas with narrower scopes or less immediate legislative impact receive relatively less attention.

Considering a temporal dimension, Figure 4 presents a time series analysis of legislative proposals from 2019 to 2024, highlighting variations in legislative activity over the years. A steady increase is observed from 212 proposals in 2019 to 416 in 2021, followed by a peak of 477 in 2023. The slight decrease in 2020 aligns with the disruptions caused by the COVID-19 pandemic, likely diverting legislative processes and priorities. The decline in 2022, however, may reflect parliamentarians’ focus on electoral campaigns during the deputies’ election year. Legislative activity rebounded in 2023, reaching a peak of 477 proposals, implying renewed momentum following the elections. The notable drop to 118 in 2024 is due to the dataset covering only the period up to April, whereas previous years encompass the entire calendar year. This analysis highlights the impact of external events, electoral cycles, and other factors on legislative dynamics over time.

Additionally, Figure 5 presents a time series analysis of the top three thematic areas of legislative proposals, spanning the years 2019 to 2024. Though Education has always been the leading thematic area for most of the years, Work took over in 2023, indicating a shift in the focus of legislation. The year in which the number of Work proposals was higher suggests more vigorous legislative attention to labor-related topics, probably because of the socioeconomic context or political priority after the 2022 elections. Health emerged in 2020, likely in response to the COVID-19 pandemic, which had placed health concerns at the forefront of debates among lawmakers. Public Administration remains present in 2020, 2021, and 2023, showing the same volume of proposals in these years, thus reflecting sustained attention to matters of governance and administration. Tributes have appeared since 2022, highlighting an increasing legislative priority on ceremonial and recognition-related discussions.

Now, focusing on the textual statistics of the legislative proposals, Figure 6 presents the boxplot and histogram of character counts for all texts after the sanitization procedure. The histogram reveals right-skewness, with 97% of entries below 10,000 characters. However, the presence of documents with much higher character counts is evident from the tail of the distribution, indicating that there are verbose proposals significantly deviating from the majority. An example of an outlier is Bill of Law 200/2023, which establishes the State Solid Waste Policy and contains 60,094 characters. The boxplot complements this analysis by summarizing the tendencies and variability of the data. The mean character count (2675)—shown by the green triangle—exceeds the median (1552), suggesting outlier influence. The boxplot omits these outliers to focus on the interquartile range (IQR), but the elevated mean relative to the median and its proximity to the upper quartile (2805) suggests some proposals have very high character counts.

Figure 7 compares the character counts across proposal types, using boxplots to show their distributions. Bills of Supplementary Law stand out with considerable variability in character count, ranging from 695 to 76,455 characters. This wide range of evidence suggests that while many proposals are concise (the median is around 3390 characters), a significant subset is exceptionally long, apparently reflecting the diverse nature of legislative content within this proposal type. This category’s mean character count (green triangle) is noticeably high (9482 characters), further emphasizing the presence of outliers. For Bills of Resolution, the mean exceeds the 75th percentile, indicating a subset of significantly lengthy proposals. Similarly, Bills of Law show a mean close but still inferior to the 75th percentile, suggesting occasional outliers. Requests, as expected, are usually succinct, but the distribution still shows a moderate skew towards texts longer than 1930 characters (mean). Constitutional Amendment Bills exhibit the least variability in character counts. Finally, Bills of Legislative Decree show moderate variability, with their mean character count aligned with the median value.

Considering a word-level analysis, Figure 8 presents the boxplot and histogram of word counts for all texts after the sanitization procedure. The distribution mirrors character-count trends, showing positive skewness and a heavy tail. The boxplot also shows the same pattern, with the mean (427 words) closer to the upper quartile value (457 words), revealing the significant presence of outliers.

In addition, from another perspective, Figure 9 indicates that the average word length (after removing stop words) spans from 5.5 to 9 characters, with both the median and mean being around seven characters long. This central tendency suggests that, in general, the words in these texts have moderate length, reflecting a possible degree of formality, technicality, or lexical richness. Such features are consistent with the linguistic characteristics of Brazilian Portuguese, which often utilizes compound words, descriptive expressions, and technical terminology, particularly in formal or structured documents.

In summary, this subsection provided a multi-dimensional exploratory analysis of the legislative corpus. It revealed that Bills of Law are the predominant proposal type (RQ3), with Education and Work as the most legislated themes (RQ4). Temporal trends reflected impacts of external events, such as COVID-19 pandemic and election cycles, and shifts in focus on specific policy areas (RQ5). Textual statistics showed high variability and frequent verbosity in proposals, with notable outliers across types (RQ6). These findings offer a foundational understanding of the corpus and support later analyses on readability and knowledge graph extraction.

5.2. Readability Assessment

Building on the ALRN proposal characterization, this section presents the results of the readability assessment conducted to evaluate legislative texts. The purpose is to analyze the textual complexity of the legislative texts using the Flesch Reading Ease and Dale–Chall formulas, thus providing insights into their readability and public comprehension.

Figure 10 illustrates the distribution of Flesch Reading Ease scores using a boxplot and a histogram. As mentioned in Section 4.2.2, the Flesch Reading Ease index measures text readability, with higher values generally suggesting easier-to-read texts. The boxplot indicates a median readability score of approximately 35, with an interquartile range (IQR) ranging from 23 to 43.

The histogram (right) further supports this trend, with most scores clustered between 20 and 50, reinforcing the conclusion that the texts are generally complex to read. Additionally, negative scores indicate texts of high complexity, often marked by dense content, complex sentence structures, and a high share of polysyllabic words.

From another perspective of readability analysis—the use of the adapted Dale–Chall index—Figure 11 presents a boxplot and a histogram illustrating the distribution of scores for this indicator. The boxplot reveals a median score of approximately 12.3, with an IQR ranging from 11.9 to 12.7, indicating that most texts fall within a relatively narrow range of difficulty levels. Notably, the mean score is slightly higher than the median, suggesting a right-skewed distribution.

In addition, this observation is further corroborated by the histogram, which shows that most scores are clustered between 10 and 15, with a few outliers extending beyond this range. As explained in Section 4.2.2, the adapted Dale–Chall metric assesses readability by evaluating the proportion of difficult words and the complexity of sentence structures. It is also aligned with grade-level equivalencies, and the scores within this scale suggest that the texts in the dataset require the reading proficiency of a graduate student.

Both indicators converge on a key finding: Legislative texts exhibit challenging readability, consistent with their formal, technical nature (RQ7). While the Flesch scores underscore the texts’ complexity, with median values indicating a high difficulty level and negative scores in some cases, the Dale–Chall index further emphasizes the necessity for advanced reading skills, with median scores corresponding to graduate-level proficiency.

In addition, these results are consistent with the formal and technical nature of legislative documents, characterized by dense structures, specialized terminology, and lengthy sentences. Together, these metrics highlight the barriers to intelligibility in legislative texts, underscoring their specialized audience and the necessity of targeted efforts to enhance their comprehensibility for broader public engagement. This point still underscores the potential need for future applications that not only extract structured knowledge but also make it understandable for non-experts, in line with recent proposals for understandable AI (uAI) [59]. Thus, this assessment offers a foundation for exploring future applications that bridge the gap between technical legal language and human-centered AI interfaces.

5.3. Generated KGs

This section presents two examples of KGs generated from legislative texts using the LLM. The goal is to demonstrate the viability of the approach in extracting structured knowledge from legislative texts for visualizing entities and their relations.

The first KG derives from a Bill of Resolution conferring an honorary citizenship title; the second derives from a Bill of Law creating a preventive dentistry program in public schools. These graphs illustrate the model’s ability to extract structured representations from unstructured legal documents, allowing for a visual depiction of entities and the relationships between them.

As previously mentioned, the first example refers to Bill of Resolution 31/2019, which grants an Honorary Title of Citizen of Rio Grande do Norte to the public figure and psychiatrist Dr. Antônio Geraldo. Figure 12 illustrates the knowledge graph generated for this resolution, which comprises 28 nodes (entities) and 36 edges (relationships). Moreover, the language model identified eight node labels (entity types): Person, Honorary Title, Organization, Law, Location, Occupation, Event, and Theme. Regarding the edges, the chosen model captured the relationships relevant to the legal context, such as “approved” (“aprovou”) and “to enact” (“promulgar”), as well as connections to the honoree’s background, such as “born_in” (“nascido_em”) and “graduated_in” (“formado_em”).

In more detail, the graph represents relevant biographical information about Dr. Antônio Geraldo, originally provided in the justification section of the Bill of Resolution. The model identified entities such as his place of birth, Grão-Mongol/MG (relationship: “born_in”/“nascido_em”), and the city where he graduated, Montes Claros (relationship: “graduated_from”/“formado_em”). It also includes his occupation as psychiatrist (relationship: “is”/“é”) and the institution he presided over—the Brazilian Association of Psychiatry (relationship: “was_presided_by”/“ser_presidido_by”). Additionally, the graph captures events in which he participated, such as the I Symposium on Chemical Dependency and the XXX Brazilian Congress of Psychiatry (relationship: “participated_in”/“participou_de”). These relationships illustrate how the language model effectively extracted semantically rich information beyond the legal scope, providing a structured representation of the honoree’s background.

Another point consider is that the highest-degree node (27 relations) corresponds to the central entity addressed in the Bill of Resolution, Dr. Antônio Geraldo, highlighting the graph’s alignment with the text’s primary focus. Additionally, the LLM demonstrated an ability to infer other entity types and relationships that were not explicitly specified in the prompt but were contextually consistent with the text. This indicates that the graph provides a coherent representation of the source material.

However, limitations include occasional misspellings in Portuguese relationship labels and unintended code-switching to other languages. Furthermore, it failed to identify all entities and relationships in the text, highlighting areas where the model’s performance could be improved.

The second example is related to the Bill of Law 94/2024, which establishes the Preventive Dentistry Program (“Programa Sorriso POPE”) in State Schools. Its corresponding KG consists of 31 nodes and 29 edges (see Figure 13). The Llama model again identified eight entity types: Person, Program, Location, Organization, Target Audience, Event, Resource, and Time. In terms of edges, the model also captured relations from the legal scenario, such as “to_decree” (“decretar”) and “establish” (“institui”), as well as those specifically related to the law itself, such as “develop_activities” (“desenvolver_ações”) and “collaborate” (“colaborar”).

Comprehensively, this KG includes several interconnected elements that describe how the “Programa Sorriso POPE” will be structured and implemented, based on the provisions of the bill’s articles. For instance, the graph identifies the State Schools of Rio Grande do Norte (relationship: “establish”/“institui”) as the primary locations where the program will be executed. It also depicts the multifaceted relationships between the program and the involved organizations, such as universities, public and private entities, and the Regional College of Dentistry, represented by terms like “collaborate”/“colaborar”, “partnership”/“parceria”, and “agreement”/“convênio”. Furthermore, the model captured the program’s key activities, including lectures, debates, distribution of oral hygiene kits, and topical fluoride application (relationship: “develop_activities”/“desenvolver_ações”). Finally, the graph identifies the students from the first year of elementary school to the third year of high school as the target audience (relationship: “to_target”/“ter_público_alvo”). This representation shows how the model effectively extracts and structures complex normative content, making explicit both the program’s stakeholders and its operational actions.

Once again, the central node of the graph, representing the “Programa Sorriso POPE”, has the highest degree (28 relations), indicating its alignment with the main idea of the legislation. Similarly, the model was also able to generate new entity types and relationships not included in the prompt and did not produce spelling errors. Nevertheless, the language model did not fully disambiguate semantically similar entities, leading to a duplication in which students (the target audience of the law) were represented as two distinct entities: (1) the general term “students” (“aluno(s)”) and (2) a more specific label describing their grade levels, “students from the 1st year of elementary school to the 3rd year of high school” (“Alunos do 1º ano do ensino fundamental até o 3º ano do ensino médio”). This duplication led to a portion of the KG becoming disconnected, creating isolated nodes.

Though advanced entity linking/clustering was beyond this study’s scope, the LLM’s prompt included explicit disambiguation instructions, and a method based on Levenshtein distance was applied during the postprocessing phase to merge entities with lexical matches. However, these preliminary efforts were insufficient to resolve more complex cases of semantic duplication, as illustrated in the previous example. Addressing this limitation by applying more advanced disambiguation methods (e.g., those based on semantic similarity measures, contextual embeddings, or rule-based systems) represents a fertile ground for future research, especially in domain-specific legal texts where precision is critical.

In summary, the presented examples demonstrate the feasibility of using LLMs to extract structured knowledge from Brazilian Portuguese legislative texts (RQ1). The generated knowledge graphs captured relevant entities and relationships, aligned with the texts’ central themes, and illustrated the model’s capacity to structure legal and contextual information. While some challenges remain, particularly in entity disambiguation, these results reinforce the potential of prompt-based LLM methods for semantic extraction. Together with the EDA and readability assessments, this section showed how the proposed methodology supports corpus understanding and structured knowledge generation in underrepresented legal contexts (RQ2).

6. Threats to Validity

The threats to the validity of this work are discussed in more detail in the following subsections, which cover sampling bias, threats to internal, construct, and external validity, as well as AI biases and hallucinations.

6.1. Sampling Bias

A risk associated with the sample obtained by ALRN is the possibility that relevant documents may not be included in the provided data. To mitigate this problem, the researchers curated a dataset with documents spanning diverse legislative topics relevant to public interest.

6.2. Internal Validity

Although we used a state-of-the-art LLM to extract entities and their relationships from legislative documents, the chosen model may struggle with nuanced legal tasks (e.g., disambiguating semantically similar entities). Thus, to attenuate this threat, the postprocessing step was added to correct any inconsistencies in the KGs generation process.

6.3. Construct Validity

Some readability metrics may overlook legal language nuances (e.g., domain-specific terminology), leading to inaccurate assessments. Therefore, an extensive analysis of state-of-the-art readability metrics was conducted to mitigate this threat and to determine the most appropriate metrics for the Brazilian Portuguese language.

6.4. External Validity

An EDA and readability evaluation of a legislative institution’s corpus were carried out. Therefore, findings may not generalize to all legislative texts or contexts beyond the studied dataset. Additionally, due to cultural and linguistic differences, these findings may not apply to all legislative texts in other languages. Therefore, generalizing the conclusions to fully validate the proposed methodology’s effectiveness is not possible.

However, the results are relevant to future investigations on LLM-supported KG extraction. Even though the findings may not generalize, the methodology’s core framework, specifically—combining NLP-driven analysis and knowledge graph extraction—has potential for adaptation to other Romance languages, such as Spanish or Italian. These languages share grammatical and syntactic features with Brazilian Portuguese, which could facilitate the application of similar pre-processing steps and language model prompts. Furthermore, many Romance language countries adopt civil law traditions with comparable legislative document structures, which may facilitate the adaptation of knowledge graph extraction workflows. Nevertheless, specific adjustments would be required to address differences in legal terminology, institutional roles, and discourse conventions in each legal system. This opens an avenue for future cross-linguistic and cross-jurisdictional studies that apply and evaluate the proposed methodology in broader contexts.

6.5. AI Biases and Hallucinations

This research highlights the potential for using artificial intelligence in analyzing legislative texts; however, it is essential to acknowledge the inherent biases and limitations associated with the use of pre-trained AI models. These include the biases inherent in the model’s training data and the potential for model hallucinations, both of which can impact the reliability and fairness of outcomes generated by AI [3,8].

The Llama 3.2 3B Instruct model used in the current study, while state-of-the-art, is an LLM trained on massive amounts of data, which may be biased [3]. Since legal concepts evolve over different periods and geographies, legal datasets—typically limited in size and biased towards more common document types—may overlook the subtle changes in legislation [8]. Then, suppose the training data consists predominantly of legislation from some geographical regions, legal traditions, or periods. In that case, the model will struggle to adjust to diverse environments, including Brazilian legislative materials. This might lead to a biased extraction of entities or relationships, especially within linguistically or culturally challenging environments such as Brazilian Portuguese. The model used was not specifically fine-tuned for the current task, but the inherent biases could still affect the results.

In addition, LLMs are prone to generating inaccurate information, a phenomenon known as hallucination [38]. In the context of legislative document analysis, this can manifest as the creation of non-existent entities, relations, or legal interpretations that have no basis in the source material [3]. As a result, hallucinations continue to pose a significant challenge to the reliability of AI-generated results.

While many techniques exist to mitigate biases and hallucinations, addressing these problems is beyond the scope of this study. The limitations in controlling these variables may have affected the outcomes, particularly in terms of the coherence and relevance of the extracted entities and relationships. Therefore, the strategies to resolve these issues will be addressed in future studies associated with this work.

7. Conclusions and Future Work

This paper proposes a data-oriented and LLM-supported methodology for extracting insights and generating KGs from Brazilian legislative texts. It aims to analyze legal documents within their proper linguistic context and leverage them as valuable research data. The proposed approach starts with a text sanitization step to create a cleaned and normalized corpus. Then, in the KG extraction phase, an LLM analyzes the corpus to identify entities and relationships. The information is refined in a postprocessing step for consistency, and the code for storing the KG is generated in a graph database. complementarily, exploratory data analysis and readability assessment can be performed to provide insights by revealing patterns, distributions, and complexity in the text.

The methodology was validated through a case study using a dataset of 1869 proposals from the Legislative Assembly of Rio Grande do Norte (ALRN). EDA revealed significant patterns, including the predominance of specific proposal categories and topics over time, as well as changes in textual lengths and statistical distributions. Furthermore, readability testing has shown inherent complexity in these legislative documents, stating that they require higher reading skills due to their formal nature, technical vocabulary, and dense structure. Finally, applying the Llama 3.2 3B Instruct model to extract knowledge graphs has demonstrated its potential to transform legislative texts into a flexible, actionable data format. This structured representation of data allows several applications to analyze similarities between legislative proposals, apply GraphRAG (graph retrieval-augmented generation) to information retrieval, simplify texts, and design advanced support tools for legislation.

In this respect, the primary contribution of this research is to enrich the structure of the legislative data to enable more sophisticated analyses and applications in legal artificial intelligence to underrepresented languages, such as Brazilian Portuguese. Accordingly, the three main contributions of this paper are as follows: first, the presentation of a curated dataset; second, the application of exploratory data analysis and readability metrics tailored for legislative texts; and third, an LLM-supported methodology to extract structured information as knowledge graphs from legislative texts.

This research acknowledges its limitations, particularly in entity disambiguation and the completeness of relationships during knowledge graph extraction. These issues call for more advanced methods in subsequent studies, which can include data annotation for label acquisition, enabling quantitative analysis, human validation, and domain-specific fine-tuning to mitigate biases and reduce hallucinations. Comparing different prompt engineering strategies, such as few-shot learning and self-reflection, could further refine entity and relationship extractions by enabling in-context learning. Additionally, exploring alternative models may also yield further performance improvements, resulting in even more contextually sound and coherent outputs. Extending the experiments to new legislative categories and evolving domains over time is also a path that can provide deeper insights into how legal frameworks develop and interconnect. This means that future research in this area can build on such pilot efforts and continue to develop graph-based reasoning more profoundly to enhance the overall quality of legislative analysis.

Author Contributions

Conceptualization, G.L.A.d.O., B.S.S. and I.S.; methodology, G.L.A.d.O.; software, G.L.A.d.O.; validation, G.L.A.d.O. and I.S.; formal analysis, G.L.A.d.O.; investigation, G.L.A.d.O.; resources, G.L.A.d.O., I.S. and B.S.S.; data curation, G.L.A.d.O.; writing—original draft preparation, G.L.A.d.O., B.S.S., M.S. and I.S.; writing—review and editing, G.L.A.d.O., B.S.S., M.S. and I.S.; visualization, G.L.A.d.O.; supervision, I.S.; project administration, I.S.; funding acquisition, I.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are available on Mendeley Data [14].

Acknowledgments

The authors gratefully acknowledge the Legislative Assembly of Rio Grande do Norte (ALRN) for providing the legislative data used in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

NLP	Natural Language Processing
KG	Knowledge Graph
LLMs	Large Language Models
ALRN	Legislative Assembly of Rio Grande do Norte
EDA	Exploratory Data Analysis
NER	Named Entity Recognition
RE	Relation Extraction
ETL	Extract, Transform, and Load

Appendix A

References

Chamber of Deputies, Brazil. The Legislative Branch. 2024. Available online: https://www2.camara.leg.br/english/papellegislativo.html (accessed on 2 January 2025).
Federal Senate, Brazil. Legislative Documents and Public Access. 2024. Available online: https://www12.senado.leg.br/institucional/carta-de-servicos/en/carta-de-servicos (accessed on 2 January 2025).
Anh, D.H.; Do, D.T.; Tran, V.; Minh, N.L. The Impact of Large Language Modeling on Natural Language Processing in Legal Texts: A Comprehensive Survey. In Proceedings of the 15th International Conference on Knowledge and Systems Engineering (KSE), Hanoi, Vietnam, 18–20 October 2023; pp. 1–7. [Google Scholar] [CrossRef]
Alves, A.; Miranda, P.; Mello, R.; Nascimento, A. Automatic Simplification of Legal Texts in Portuguese Using Machine Learning. In Legal Knowledge and Information Systems; IOS Press: Amsterdam, The Netherlands, 2023; pp. 281–286. [Google Scholar] [CrossRef]
Albuquerque, H.O.; Souza, E.; Gomes, C.; Pinto, M.H.d.C.; Ricardo Filho, P.; Costa, R.; Lopes, V.T.d.M.; da Silva, N.F.; de Carvalho, A.C.; Oliveira, A.L. Named Entity Recognition: A Survey for the Portuguese Language. Proces. Leng. Nat. 2023, 70, 171–185. [Google Scholar] [CrossRef]
Moreira Valle, L.; Giacomazzi Dantas, S.; Guerreiro e Silva, D.; Silva Dias, U.; Monteiro Monasterio, L. RegBR: A novel Brazilian government framework to classify and analyze industry-specific regulations. PLoS ONE 2022, 17, e0275282. [Google Scholar] [CrossRef] [PubMed]
Fitsilis, F.; Mikros, G. Smart Parliaments: Data-Driven Democracy; European Liberal Forum: Ixelles, Belgium, 2022. [Google Scholar]
Lai, J.; Gan, W.; Wu, J.; Qi, Z.; Yu, P.S. Large language models in law: A survey. AI Open 2024, 5, 181–196. [Google Scholar] [CrossRef]
Negro, A. Graph-Powered Machine Learning; Manning Publications Co.: Shelter Island, NY, USA, 2021. [Google Scholar]
Schneider, P.; Schopf, T.; Vladika, J.; Galkin, M.; Simperl, E.; Matthes, F. A Decade of Knowledge Graphs in Natural Language Processing: A Survey. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online only, 20–23 November 2022; pp. 601–614. [Google Scholar] [CrossRef]
Wu, L.; Chen, Y.; Shen, K.; Guo, X.; Gao, H.; Li, S.; Pei, J.; Long, B. Graph Neural Networks for Natural Language Processing: A Survey. Found. Trends Mach. Learn. 2023, 16, 119–328. [Google Scholar] [CrossRef]
Ji, S.; Pan, S.; Cambria, E.; Marttinen, P.; Yu, P.S. A Survey on Knowledge Graphs: Representation, Acquisition, and Applications. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 494–514. [Google Scholar] [CrossRef]
Liang, X.; Wang, Z.; Li, M.; Yan, Z. A survey of LLM-augmented knowledge graph construction and application in complex product design. Procedia CIRP 2024, 128, 870–875. [Google Scholar] [CrossRef]
Alves, G.; Santos, B.S.; Silva, M.; Silva, I. Brazilian Portuguese Legislative Documents: A Dataset from the Legislative Assembly of Rio Grande do Norte; Mendeley Data, Version 1; Universidade Federal do Rio Grande do Norte: Natal, Brazil, 2025. [Google Scholar] [CrossRef]
Palmirani, M.; Vitali, F.; Van Puymbroeck, W.; Nubla Durango, F. Legal Drafting in the Era of Artificial Intelligence and Digitisation; European Commission: Brussels, Belgium, 2022. [Google Scholar]
Souza, E.; Moriyama, G.; Vitório, D.; Carvalho, A.C.P.L.F.d.; Félix, N.; Albuquerque, H.O.; Oliveira, A.L.I. Assessing the Impact of Stemming Algorithms Applied to Brazilian Legislative Documents Retrieval. In Proceedings of the Anais do Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (STIL), Online, 29 November–3 December 2021; SBC: Brisbane, QLD, USA, 2021; pp. 227–236. [Google Scholar] [CrossRef]
Albuquerque, H.O.; Costa, R.; Silvestre, G.; Souza, E.; da Silva, N.F.F.; Vitório, D.; Moriyama, G.; Martins, L.; Soezima, L.; Nunes, A.; et al. UlyssesNER-Br: A Corpus of Brazilian Legislative Documents for Named Entity Recognition. In Proceedings of the Computational Processing of the Portuguese Language, Fortaleza, Brazil, 21–23 March 2022; pp. 3–14. [Google Scholar] [CrossRef]
Rocha, F.C.; Souza, E.; Vitório, D.; Silva, N.F.F.d.; Carvalho, A.C.P.L.F.d.; Oliveira, A.L.I. Avaliação de frameworks para Recuperação de Documentos Legislativos: Um Estudo de Caso na Câmara dos Deputados Brasileira. In Proceedings of the Anais do Workshop de Computação Aplicada em Governo Eletrônico (WCGE), João Pessoa, Brazil, 6–11 August 2023; SBC: Brisbane, QLD, USA, 2023; pp. 224–231. [Google Scholar] [CrossRef]
Souza, E.; Vitório, D.; Moriyama, G.; Santos, L.; Martins, L.; Souza, M.; Fonseca, M.; Félix, N.; Carvalho, A.C.; Albuquerque, H.O.; et al. An Information Retrieval Pipeline for Legislative Documents from the Brazilian Chamber of Deputies. In Legal Knowledge and Information Systems; Schweighofer, E., Ed.; Frontiers in Artificial Intelligence and Applications; IOS Press: Amsterdam, The Netherlands, 2021. [Google Scholar] [CrossRef]
Vitório, D.; Souza, E.; Martins, L.; da Silva, N.F.F.; de Leon Ferreira de Carvalho, A.C.P.; Oliveira, A.L.I. Ulysses-RFSQ: A Novel Method to Improve Legal Information Retrieval Based on Relevance Feedback. In Intelligent Systems, Proceedings of the 11th Brazilian Conference, Campinas, Brazil, 28 November–1 December 2022; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2022; pp. 77–91. [Google Scholar] [CrossRef]
Vitório, D.; Souza, E.; Martins, L.; da Silva, N.F.F.; de Carvalho, A.P.d.L.; Oliveira, A.L.I.; de Andrade, F.E. Building a Relevance Feedback Corpus for Legal Information Retrieval in the Real-Case Scenario of the Brazilian Chamber of Deputies. Lang. Resour. Eval. 2024, 59, 1257–1277. [Google Scholar] [CrossRef]
Maia, D.F.; Silva, N.F.F.; Souza, E.P.R.; Nunes, A.S.; Procópio, L.C.; Sampaio, G.d.S.; Dias, M.d.S.; Alves, A.O.; Maia, D.F.; Ribeiro, I.A.; et al. UlyssesSD-Br: Stance Detection in Brazilian Political Polls. In Progress in Artificial Intelligence, Proceedings of the 21st EPIA Conference on Artificial Intelligence, Lisbon, Portugal, 31 August–2 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 85–95. [Google Scholar] [CrossRef]
Silva, N.F.F.d.; Silva, M.C.R.; Pereira, F.S.F.; Tarrega, J.P.M.; Beinotti, J.V.P.; Fonseca, M.; Andrade, F.E.d.; de Carvalho, A.C.P.d.L.F. Evaluating Topic Models in Portuguese Political Comments About Bills from Brazil’s Chamber of Deputies. In Proceedings of the Intelligent Systems: 10th Brazilian Conference, BRACIS 2021, Virtual Event, 29 November–3 December 2021; pp. 104–120. [Google Scholar] [CrossRef]
Cifuentes-Silva, F.; Labra Gayo, J.E. Legislative Document Content Extraction Based on Semantic Web Technologies. In Proceedings of the Semantic Web (ESWC 2019), 16th International Conference, Portorož, Slovenia, 2–6 June 2019; pp. 558–573. [Google Scholar] [CrossRef]
Colombo, A.; Bernasconi, A.; Ceri, S. Modelling Legislative Systems into Property Graphs to Enable Advanced Pattern Detection. arXiv 2024, arXiv:2406.14935. [Google Scholar] [CrossRef]
Oliveira, F.d.; Oliveird, J.M.P.d. A RDF-based graph to representing and searching parts of legal documents. Artif. Intell. Law 2023, 32, 667–695. [Google Scholar] [CrossRef]
Bianchini, F.; Calamo, M.; De Luzi, F.; Macrì, M.; Mecella, M. A Service-Based Pipeline for Complex Linguistic Tasks Adopting LLMs and Knowledge Graphs. In Service-Oriented Computing, Proceedings of the 18th Symposium and Summer School, SummerSOC 2024, Crete, Greece, 24–29 June 2024; Springer: Berlin/Heidelberg, Germany, 2025; pp. 145–161. [Google Scholar] [CrossRef]
Colombo, A. Leveraging Knowledge Graphs and LLMs to Support and Monitor Legislative Systems. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; pp. 5443–5446. [Google Scholar] [CrossRef]
Gao, S.; Li, Y.; Ge, F.; Lin, M.; Yu, H.; Wang, S.; Miao, Z. LeGalFormer: A Graph Representation Learning and Transformer-based Approach for Legal Similar Case Retrieval. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–9. [Google Scholar] [CrossRef]
Li, J.; Qian, L.; Liu, P.; Liu, T. Construction of Legal Knowledge Graph Based on Knowledge-Enhanced Large Language Models. Information 2024, 15, 666. [Google Scholar] [CrossRef]
Shi, J.; Guo, Q.; Liao, Y.; Wang, Y.; Chen, S.; Liang, S. Legal-LM: Knowledge Graph Enhanced Large Language Models for Law Consulting. In Proceedings of the Advanced Intelligent Computing Technology and Applications. Springer Nature Singapore, Tianjin, China, 5–8 August 2024; pp. 175–186. [Google Scholar] [CrossRef]
Speer, R. ftfy, version 5.5; Zenodo: Geneva, Switzerland, 2019. [Google Scholar] [CrossRef]
Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2009. [Google Scholar]
Keraghel, I.; Morbieu, S.; Nadif, M. Recent Advances in Named Entity Recognition: A Comprehensive Survey and Comparative Study. arXiv 2024, arXiv:2401.10825. [Google Scholar] [CrossRef]
Zhang, L.; Sun, X.; Ma, X.; Hu, K. A New Entity Relationship Extraction Method for Semi-Structured Patent Documents. Electronics 2024, 13, 3144. [Google Scholar] [CrossRef]
Bratanič, T. Graph Algorithms for Data Science; Manning Publications Co.: Shelter Island, NY, USA, 2023. [Google Scholar]
Zhu, Y.; Wang, X.; Chen, J.; Qiao, S.; Ou, Y.; Yao, Y.; Deng, S.; Chen, H.; Zhang, N. LLMs for knowledge graph construction and reasoning: Recent capabilities and future opportunities. World Wide Web 2024, 27, 58. [Google Scholar] [CrossRef]
Negro, A.; Kus, V.; Futia, G.; Montagna, F. Knowledge Graphs and LLMs in Action; Manning Publications Co.: Shelter Island, NY, USA, 2025. [Google Scholar]
Rao, P.J.; Rao, K.N.; Gokuruboyina, S.; Neeraja, K. An Efficient Methodology for Identifying the Similarity Between Languages with Levenshtein Distance. In Proceedings of the 6th International Conference on Communications and Cyber Physical Engineering, Hyderabad, India, 28–29 April 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 161–174. [Google Scholar] [CrossRef]
Santos, B.S.; Silva, I.; Melo, E. Metodologia orientada a ciência de dados em grafos para avaliação de PPGs. In Proceedings of the XV Simpósio Brasileiro de Automação Inteligente (SBAI 2021), Rio Grande, Rio Grande do Sul, Brazil, 17–19 October 2021; pp. 1998–2005. [Google Scholar] [CrossRef]
Santos, B.S.; Silva, I.; Costa, D.G. Symmetry in Scientific Collaboration Networks: A Study Using Temporal Graph Data Science and Scientometrics. Symmetry 2023, 15, 601. [Google Scholar] [CrossRef]
Legislative Assembly of Rio Grande do Norte - ALRN. Unale 2024: Director of Technology Management Presents Advances in Artificial Intelligence. 2024. Available online: https://www.al.rn.leg.br/noticia/31558/unale-2024-diretor-de-gestao-tecnologica-apresenta-avancos-em-inteligencia-artificial (accessed on 20 December 2024).
Meta AI. Llama 3.2 Model Card. 2024. Available online: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct (accessed on 16 December 2024).
Meta AI. Llama 3.2: Advancing AI for Vision and Language at the Edge and Beyond. 2024. Available online: https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ (accessed on 16 December 2024).
Robinson, I.; Webber, J.; Eifrem, E. Graph Databases: New Opportunities for Connected Data, 2nd ed.; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2015. [Google Scholar]
Anthapu, R. Graph Data Processing with Cypher; Packt Publishing: Birmingham, UK, 2022. [Google Scholar]
Scifo, E. Graph Data Science with Neo4j; Packt Publishing: Birmingham, UK, 2023. [Google Scholar]
Martins, T.B.F.; Ghiraldelo, C.M.; Nunes, M.d.G.V.; Oliveira Júnior, O.N.d. Readability formulas applied to textbooks in brazilian portuguese. In Notas do ICMSC; Série Computação; ICMSC-USP: São Carlos, Brazil, 1996; pp. 1–15. [Google Scholar]
Leal, S.E.; Duran, M.S.; Scarton, C.E.; Hartmann, N.S.; Aluísio, S.M. NILC-Metrix: Assessing the complexity of written and spoken language in Brazilian Portuguese. Lang. Resour. Eval. 2024, 58, 73–110. [Google Scholar] [CrossRef]
Biderman, M.T.C. Dicionário Didático de Português; Editora ática: São Paulo, Brazil, 1998. [Google Scholar]
McKinney, W. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and Jupyter; O’Reilly Media: Sebastopol, CA, USA, 2022. [Google Scholar]
Döbler, M.; Groβmann, T. Data Visualization with Python; Packt Publishing Ltd.: Birmingham, UK, 2019. [Google Scholar]
Anaconda Inc. Anaconda: The Data Science Platform. 2024. Available online: https://www.anaconda.com (accessed on 10 December 2024).
Google Inc. Google Colab: Hi, This Is the Colaboratory. 2024. Available online: https://colab.research.google.com (accessed on 10 December 2024).
Tunstall, L.; von Werra, L.; Wolf, T. Natural Language Processing with Transformers; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2022. [Google Scholar]
Alves, G.; Silva, I. GitHub Repository of This Study. 2025. Available online: https://github.com/conect2ai/legislative-texts-rn (accessed on 5 January 2025).
Brazilian National Congress. Glossary of Legislative Terms, 2nd ed.; Brazilian National Congress: Brasília, Brazil, 2020; Available online: https://www.congressonacional.leg.br/legislacao-e-publicacoes/glossario-legislativo (accessed on 24 November 2024).
Legislative Assembly of the State of São Paulo. Legislative Process Manual; Legislative Assembly of the State of São Paulo: São Paulo, Brazil, 2023. Available online: https://www.al.sp.gov.br/arquivos/documentacao/estudos-e-manuais/manual-processo-legislativo/manual_proclegis_2.pdf (accessed on 19 December 2024).
Abbass, H.; Crockett, K.; Garibaldi, J.; Gegov, A.; Kaymak, U.; Sousa, J.M.C. Editorial: From Explainable Artificial Intelligence (xAI) to Understandable Artificial Intelligence (uAI). IEEE Trans. Artif. Intell. 2024, 5, 4310–4314. [Google Scholar] [CrossRef]

Figure 1. Overview of the methodology. The solid line represents the main methodology pathway, while the dashed lines indicate complementary analyses that provide insights into the data.

Figure 2. Number of proposals by type from January 2019 to April 2024. The Bill of Law comprises 74.7% of all proposals (1396 out of 1869), reflecting its role as the primary legislative instrument.

Figure 3. Number of proposals by thematic area from January 2019 to April 2024. Education dominates with 29.1% (544 proposals), followed by Work (23.3%) and Public Administration (12.6%), while areas like Animal cause, Tourism, and Sport show minimal focus with two proposals each.

Figure 4. Number of proposals by year of registration from January 2019 to April 2024. Legislative activity peaked in 2023 with 477 proposals, reflecting post-election momentum, while decreases in 2020 and 2022 align with the COVID-19 pandemic and electoral campaigns, respectively.

Figure 5. Top 3 thematic areas with the highest number of proposals per year from January 2019 to April 2024. Education leads most years, but Work overtook it in 2023, reflecting shifting legislative priorities.

Figure 6. Character count of the sanitized document texts, considering the proposals from January 2019 to April 2024. Most texts (97%) have fewer than 10,000 characters, but a right-skewed distribution reveals verbose outliers. An example of an outlier is the Bill of Law 200/2023, which establishes the State Solid Waste Policy and contains 60,094 characters, far exceeding the mean character count.

Figure 7. Character count of the sanitized document texts across proposal types, dating from January 2019 to April 2024. Bills of Supplementary Law show the greatest variability and longest texts (mean: 9482), while Requests are usually concise (mean: 1930).

Figure 8. Word count of the sanitized document texts, considering the proposals from January 2019 to April 2024. The distribution is positively skewed, with the mean (427) close to the upper quartile (457), highlighting the presence of outliers.

Figure 9. Average word length of the sanitized document texts, considering the proposals from January 2019 to April 2024. The average length ranges from 5.5 to 9 characters, with both the mean and median around seven.

Figure 10. Flesch Reading Ease indices of the sanitized document texts, considering the proposals from January 2019 to April 2024. The median score is approximately 35, with most texts falling between 20 and 50, indicating a generally complex readability level.

Figure 11. Adapted Dale–Chall metric of the sanitized document texts, considering the proposals from January 2019 to April 2024. The median score of 12.3 indicates graduate-level reading difficulty.

Figure 12. Knowledge graph of the Bill of Resolution 31/2019, granting an Honorary Title of Citizen of Rio Grande do Norte. Comprising 28 nodes and 36 edges, it captures key entity types such as Person, Honorary Title, and Organization, with Dr. Antônio Geraldo, the honoree, as the central entity. The graph also encodes relevant biographical details, including his place of birth, Grão-Mongol/MG (relationship: “born_in”/“nascido_em”), and his occupation as psychiatrist (relationship: “is”/“é”). Furthermore, it shows the institution he presided—Brazilian Association of Psychiatry (relationship: “was_presided_by”/“ser_presidido_by”)—and events in which he participated, such as the I Symposium on Chemical Dependency (relationship: “participated_in”/“participou_de”). This figure exemplifies how the model extracts and represents both legal and biographical information within a structured semantic network.

Figure 13. Knowledge graph of the Bill of Law 94/2024, which establishes the “Programa Sorriso POPE” in State Schools. The graph includes 31 nodes and 29 edges, with the program as the central entity, capturing key relationships to Target Audience, Events, and Organizations involved. It identifies key relationships such as establish (“institui”) linking the program to State Schools, and to_target (“ter_público_alvo”) connecting it to students from the 1st year of elementary school to the 3rd year of high school. Additionally, it illustrates partnerships with universities, public and private entities, and the Regional College of Dentistry through relationships like collaborate (“colaborar”), partnership (“parceria”), and agreement (“convênio”). The model also detected the program’s main activities—lectures, debates, distribution of oral hygiene kits, and topical fluoride application—under the relationship “develop_activities” (“desenvolver_ações”). This figure exemplifies how the model captures and structures the normative provisions, making explicit both the operational and institutional aspects of the legislative initiative.

Table 1. Dataset columns and their descriptions.

Feature	Description
process_number	An internal control number assigned to the legislative proposal during its processing within the legislative system.
process_year	The year when the legislative proposal was registered in the legislative process. It complements the `process_number` to ensure the proposal’s uniqueness and traceability during internal processing.
proposal_number	A unique identifier assigned to a proposal once it is officially publicized in a legislative session.
proposal_year	The year in which the proposal was officially publicized.
proposal_summary	A brief description of the proposal, summarizing its main content and providing an immediate understanding of the legislated matter.
proposal_register_date	The date when the proposal was formally registered in the legislative system.
proposal_initiative_description	The name of the entity or individual responsible for initiating the proposal.
proposal_initiative_type	The type of initiative associated with the proposal, indicating whether a parliamentarian or an internal/external entity authored it.
proposal_subject_description	A textual description of the thematic subject or area the proposal addresses (e.g., education, healthcare, infrastructure).
proposal_type	The proposal classification is based on its type of normative instrument (e.g., Bill of Law, Bill of Supplementary Law, Bill of Legislative Decree, Bill of Resolution, Constitutional Amendment Bill, or Request).
doc_id	A unique identifier for the document within the legislative system.
doc_subject	Commonly, it is the document name related to a specific topic addressed within the document or its type.
doc_type	The type of the document based on its purpose (e.g., Engrossed Bill, Recommendation, Memo, Communication, or descriptions similar to the `proposal_type` examples).
doc_text	Full textual content of the document with HTML formatting tags.
doc_inclusion_date_proposal	The date when the document was included within the proposal.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Oliveira, G.L.A.d.; Santos, B.S.; Silva, M.; Silva, I. Exploring Legislative Textual Data in Brazilian Portuguese: Readability Analysis and Knowledge Graph Generation. Data 2025, 10, 106. https://doi.org/10.3390/data10070106

AMA Style

Oliveira GLAd, Santos BS, Silva M, Silva I. Exploring Legislative Textual Data in Brazilian Portuguese: Readability Analysis and Knowledge Graph Generation. Data. 2025; 10(7):106. https://doi.org/10.3390/data10070106

Chicago/Turabian Style

Oliveira, Gisliany Lillian Alves de, Breno Santana Santos, Marianne Silva, and Ivanovitch Silva. 2025. "Exploring Legislative Textual Data in Brazilian Portuguese: Readability Analysis and Knowledge Graph Generation" Data 10, no. 7: 106. https://doi.org/10.3390/data10070106

APA Style

Oliveira, G. L. A. d., Santos, B. S., Silva, M., & Silva, I. (2025). Exploring Legislative Textual Data in Brazilian Portuguese: Readability Analysis and Knowledge Graph Generation. Data, 10(7), 106. https://doi.org/10.3390/data10070106

Article Menu

Exploring Legislative Textual Data in Brazilian Portuguese: Readability Analysis and Knowledge Graph Generation

Abstract

1. Introduction

2. Related Works

2.1. NLP and Readability Analysis of Legal Texts

2.2. Knowledge Graphs and LLMs Applied to Legal Documents

3. Proposed Methodology

3.1. Text Sanitization

3.2. KG Extraction

3.3. Postprocessing

3.4. Exploratory Data and Readability Analyses

4. Materials and Methods

4.1. Goal Definition

4.2. Planning

4.2.1. Participant and Artifact Selection

4.2.2. Research Questions

4.2.3. Instrumentation

4.3. Operation

5. Results and Discussion

5.1. Exploratory Data Analysis

5.2. Readability Assessment

5.3. Generated KGs

6. Threats to Validity

6.1. Sampling Bias

6.2. Internal Validity

6.3. Construct Validity

6.4. External Validity

6.5. AI Biases and Hallucinations

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI