Exploring Legislative Textual Data in Brazilian Portuguese: Readability Analysis and Knowledge Graph Generation
Abstract
1. Introduction
2. Related Works
2.1. NLP and Readability Analysis of Legal Texts
2.2. Knowledge Graphs and LLMs Applied to Legal Documents
3. Proposed Methodology
3.1. Text Sanitization
3.2. KG Extraction
- The texts provided include laws, decrees, resolutions, amendments, motions, messages, and indications. Information should only be captured once in cases of redundancy within the text.
- Entities must be categorized into specific types, including Person, Organization, Law, Location, Theme, Occupation, Event, Target Audience, Products/Services, Program, and Time.
- Entities referred to by pronouns or variations in naming (e.g., “José da Silva”, “José”, “he”) must be resolved to their most complete and consistent form (e.g., “José da Silva”).
- Relations must be explicit, concise, and directionally meaningful, typically derived from verbs or expressions within the text. Relationships include legislative actions such as “aprovar” (to aprove), “propor” (to propose), “revogar” (to revoke), and “decretar” (to decree).
- The model is explicitly instructed not to infer or add information beyond what is present in the text.
3.3. Postprocessing
3.4. Exploratory Data and Readability Analyses
4. Materials and Methods
4.1. Goal Definition
4.2. Planning
4.2.1. Participant and Artifact Selection
4.2.2. Research Questions
- RQ1: Can language models and prompt engineering effectively extract knowledge graphs from unstructured Brazilian Portuguese legislative texts?
- RQ2: Is the proposed methodology suitable for supporting corpus characterization, textual readability evaluation, and KG generation?
- RQ3: What is the distribution of legislative proposals by type?
- RQ4: What are the most prevalent thematic areas in the legislative proposals?
- RQ5: Based on the distribution of proposals, is it possible to detect patterns in legislative activities over time?
- RQ6: Based on textual statistics, can it be determined how verbose legislative proposals are?
- RQ7: Based on textual readability indicators, how readable are the texts of ALRN’s legislative proposals?
4.2.3. Instrumentation
- Anaconda’s JupyterLab;
- The corpus related to the legislative proposals of ALRN, previously discussed in Section 4.2.1 and available at Mendeley Data [14];
- The Jupyter Notebooks and Python Scripts that contain all source code to perform the data analysis, which are available in the GitHub repository [56].
4.3. Operation
5. Results and Discussion
5.1. Exploratory Data Analysis
5.2. Readability Assessment
5.3. Generated KGs
6. Threats to Validity
6.1. Sampling Bias
6.2. Internal Validity
6.3. Construct Validity
6.4. External Validity
6.5. AI Biases and Hallucinations
7. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
NLP | Natural Language Processing |
KG | Knowledge Graph |
LLMs | Large Language Models |
ALRN | Legislative Assembly of Rio Grande do Norte |
EDA | Exploratory Data Analysis |
NER | Named Entity Recognition |
RE | Relation Extraction |
ETL | Extract, Transform, and Load |
Appendix A
References
- Chamber of Deputies, Brazil. The Legislative Branch. 2024. Available online: https://www2.camara.leg.br/english/papellegislativo.html (accessed on 2 January 2025).
- Federal Senate, Brazil. Legislative Documents and Public Access. 2024. Available online: https://www12.senado.leg.br/institucional/carta-de-servicos/en/carta-de-servicos (accessed on 2 January 2025).
- Anh, D.H.; Do, D.T.; Tran, V.; Minh, N.L. The Impact of Large Language Modeling on Natural Language Processing in Legal Texts: A Comprehensive Survey. In Proceedings of the 15th International Conference on Knowledge and Systems Engineering (KSE), Hanoi, Vietnam, 18–20 October 2023; pp. 1–7. [Google Scholar] [CrossRef]
- Alves, A.; Miranda, P.; Mello, R.; Nascimento, A. Automatic Simplification of Legal Texts in Portuguese Using Machine Learning. In Legal Knowledge and Information Systems; IOS Press: Amsterdam, The Netherlands, 2023; pp. 281–286. [Google Scholar] [CrossRef]
- Albuquerque, H.O.; Souza, E.; Gomes, C.; Pinto, M.H.d.C.; Ricardo Filho, P.; Costa, R.; Lopes, V.T.d.M.; da Silva, N.F.; de Carvalho, A.C.; Oliveira, A.L. Named Entity Recognition: A Survey for the Portuguese Language. Proces. Leng. Nat. 2023, 70, 171–185. [Google Scholar] [CrossRef]
- Moreira Valle, L.; Giacomazzi Dantas, S.; Guerreiro e Silva, D.; Silva Dias, U.; Monteiro Monasterio, L. RegBR: A novel Brazilian government framework to classify and analyze industry-specific regulations. PLoS ONE 2022, 17, e0275282. [Google Scholar] [CrossRef] [PubMed]
- Fitsilis, F.; Mikros, G. Smart Parliaments: Data-Driven Democracy; European Liberal Forum: Ixelles, Belgium, 2022. [Google Scholar]
- Lai, J.; Gan, W.; Wu, J.; Qi, Z.; Yu, P.S. Large language models in law: A survey. AI Open 2024, 5, 181–196. [Google Scholar] [CrossRef]
- Negro, A. Graph-Powered Machine Learning; Manning Publications Co.: Shelter Island, NY, USA, 2021. [Google Scholar]
- Schneider, P.; Schopf, T.; Vladika, J.; Galkin, M.; Simperl, E.; Matthes, F. A Decade of Knowledge Graphs in Natural Language Processing: A Survey. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online only, 20–23 November 2022; pp. 601–614. [Google Scholar] [CrossRef]
- Wu, L.; Chen, Y.; Shen, K.; Guo, X.; Gao, H.; Li, S.; Pei, J.; Long, B. Graph Neural Networks for Natural Language Processing: A Survey. Found. Trends Mach. Learn. 2023, 16, 119–328. [Google Scholar] [CrossRef]
- Ji, S.; Pan, S.; Cambria, E.; Marttinen, P.; Yu, P.S. A Survey on Knowledge Graphs: Representation, Acquisition, and Applications. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 494–514. [Google Scholar] [CrossRef]
- Liang, X.; Wang, Z.; Li, M.; Yan, Z. A survey of LLM-augmented knowledge graph construction and application in complex product design. Procedia CIRP 2024, 128, 870–875. [Google Scholar] [CrossRef]
- Alves, G.; Santos, B.S.; Silva, M.; Silva, I. Brazilian Portuguese Legislative Documents: A Dataset from the Legislative Assembly of Rio Grande do Norte; Mendeley Data, Version 1; Universidade Federal do Rio Grande do Norte: Natal, Brazil, 2025. [Google Scholar] [CrossRef]
- Palmirani, M.; Vitali, F.; Van Puymbroeck, W.; Nubla Durango, F. Legal Drafting in the Era of Artificial Intelligence and Digitisation; European Commission: Brussels, Belgium, 2022. [Google Scholar]
- Souza, E.; Moriyama, G.; Vitório, D.; Carvalho, A.C.P.L.F.d.; Félix, N.; Albuquerque, H.O.; Oliveira, A.L.I. Assessing the Impact of Stemming Algorithms Applied to Brazilian Legislative Documents Retrieval. In Proceedings of the Anais do Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (STIL), Online, 29 November–3 December 2021; SBC: Brisbane, QLD, USA, 2021; pp. 227–236. [Google Scholar] [CrossRef]
- Albuquerque, H.O.; Costa, R.; Silvestre, G.; Souza, E.; da Silva, N.F.F.; Vitório, D.; Moriyama, G.; Martins, L.; Soezima, L.; Nunes, A.; et al. UlyssesNER-Br: A Corpus of Brazilian Legislative Documents for Named Entity Recognition. In Proceedings of the Computational Processing of the Portuguese Language, Fortaleza, Brazil, 21–23 March 2022; pp. 3–14. [Google Scholar] [CrossRef]
- Rocha, F.C.; Souza, E.; Vitório, D.; Silva, N.F.F.d.; Carvalho, A.C.P.L.F.d.; Oliveira, A.L.I. Avaliação de frameworks para Recuperação de Documentos Legislativos: Um Estudo de Caso na Câmara dos Deputados Brasileira. In Proceedings of the Anais do Workshop de Computação Aplicada em Governo Eletrônico (WCGE), João Pessoa, Brazil, 6–11 August 2023; SBC: Brisbane, QLD, USA, 2023; pp. 224–231. [Google Scholar] [CrossRef]
- Souza, E.; Vitório, D.; Moriyama, G.; Santos, L.; Martins, L.; Souza, M.; Fonseca, M.; Félix, N.; Carvalho, A.C.; Albuquerque, H.O.; et al. An Information Retrieval Pipeline for Legislative Documents from the Brazilian Chamber of Deputies. In Legal Knowledge and Information Systems; Schweighofer, E., Ed.; Frontiers in Artificial Intelligence and Applications; IOS Press: Amsterdam, The Netherlands, 2021. [Google Scholar] [CrossRef]
- Vitório, D.; Souza, E.; Martins, L.; da Silva, N.F.F.; de Leon Ferreira de Carvalho, A.C.P.; Oliveira, A.L.I. Ulysses-RFSQ: A Novel Method to Improve Legal Information Retrieval Based on Relevance Feedback. In Intelligent Systems, Proceedings of the 11th Brazilian Conference, Campinas, Brazil, 28 November–1 December 2022; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2022; pp. 77–91. [Google Scholar] [CrossRef]
- Vitório, D.; Souza, E.; Martins, L.; da Silva, N.F.F.; de Carvalho, A.P.d.L.; Oliveira, A.L.I.; de Andrade, F.E. Building a Relevance Feedback Corpus for Legal Information Retrieval in the Real-Case Scenario of the Brazilian Chamber of Deputies. Lang. Resour. Eval. 2024, 59, 1257–1277. [Google Scholar] [CrossRef]
- Maia, D.F.; Silva, N.F.F.; Souza, E.P.R.; Nunes, A.S.; Procópio, L.C.; Sampaio, G.d.S.; Dias, M.d.S.; Alves, A.O.; Maia, D.F.; Ribeiro, I.A.; et al. UlyssesSD-Br: Stance Detection in Brazilian Political Polls. In Progress in Artificial Intelligence, Proceedings of the 21st EPIA Conference on Artificial Intelligence, Lisbon, Portugal, 31 August–2 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 85–95. [Google Scholar] [CrossRef]
- Silva, N.F.F.d.; Silva, M.C.R.; Pereira, F.S.F.; Tarrega, J.P.M.; Beinotti, J.V.P.; Fonseca, M.; Andrade, F.E.d.; de Carvalho, A.C.P.d.L.F. Evaluating Topic Models in Portuguese Political Comments About Bills from Brazil’s Chamber of Deputies. In Proceedings of the Intelligent Systems: 10th Brazilian Conference, BRACIS 2021, Virtual Event, 29 November–3 December 2021; pp. 104–120. [Google Scholar] [CrossRef]
- Cifuentes-Silva, F.; Labra Gayo, J.E. Legislative Document Content Extraction Based on Semantic Web Technologies. In Proceedings of the Semantic Web (ESWC 2019), 16th International Conference, Portorož, Slovenia, 2–6 June 2019; pp. 558–573. [Google Scholar] [CrossRef]
- Colombo, A.; Bernasconi, A.; Ceri, S. Modelling Legislative Systems into Property Graphs to Enable Advanced Pattern Detection. arXiv 2024, arXiv:2406.14935. [Google Scholar] [CrossRef]
- Oliveira, F.d.; Oliveird, J.M.P.d. A RDF-based graph to representing and searching parts of legal documents. Artif. Intell. Law 2023, 32, 667–695. [Google Scholar] [CrossRef]
- Bianchini, F.; Calamo, M.; De Luzi, F.; Macrì, M.; Mecella, M. A Service-Based Pipeline for Complex Linguistic Tasks Adopting LLMs and Knowledge Graphs. In Service-Oriented Computing, Proceedings of the 18th Symposium and Summer School, SummerSOC 2024, Crete, Greece, 24–29 June 2024; Springer: Berlin/Heidelberg, Germany, 2025; pp. 145–161. [Google Scholar] [CrossRef]
- Colombo, A. Leveraging Knowledge Graphs and LLMs to Support and Monitor Legislative Systems. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; pp. 5443–5446. [Google Scholar] [CrossRef]
- Gao, S.; Li, Y.; Ge, F.; Lin, M.; Yu, H.; Wang, S.; Miao, Z. LeGalFormer: A Graph Representation Learning and Transformer-based Approach for Legal Similar Case Retrieval. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–9. [Google Scholar] [CrossRef]
- Li, J.; Qian, L.; Liu, P.; Liu, T. Construction of Legal Knowledge Graph Based on Knowledge-Enhanced Large Language Models. Information 2024, 15, 666. [Google Scholar] [CrossRef]
- Shi, J.; Guo, Q.; Liao, Y.; Wang, Y.; Chen, S.; Liang, S. Legal-LM: Knowledge Graph Enhanced Large Language Models for Law Consulting. In Proceedings of the Advanced Intelligent Computing Technology and Applications. Springer Nature Singapore, Tianjin, China, 5–8 August 2024; pp. 175–186. [Google Scholar] [CrossRef]
- Speer, R. ftfy, version 5.5; Zenodo: Geneva, Switzerland, 2019. [Google Scholar] [CrossRef]
- Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2009. [Google Scholar]
- Keraghel, I.; Morbieu, S.; Nadif, M. Recent Advances in Named Entity Recognition: A Comprehensive Survey and Comparative Study. arXiv 2024, arXiv:2401.10825. [Google Scholar] [CrossRef]
- Zhang, L.; Sun, X.; Ma, X.; Hu, K. A New Entity Relationship Extraction Method for Semi-Structured Patent Documents. Electronics 2024, 13, 3144. [Google Scholar] [CrossRef]
- Bratanič, T. Graph Algorithms for Data Science; Manning Publications Co.: Shelter Island, NY, USA, 2023. [Google Scholar]
- Zhu, Y.; Wang, X.; Chen, J.; Qiao, S.; Ou, Y.; Yao, Y.; Deng, S.; Chen, H.; Zhang, N. LLMs for knowledge graph construction and reasoning: Recent capabilities and future opportunities. World Wide Web 2024, 27, 58. [Google Scholar] [CrossRef]
- Negro, A.; Kus, V.; Futia, G.; Montagna, F. Knowledge Graphs and LLMs in Action; Manning Publications Co.: Shelter Island, NY, USA, 2025. [Google Scholar]
- Rao, P.J.; Rao, K.N.; Gokuruboyina, S.; Neeraja, K. An Efficient Methodology for Identifying the Similarity Between Languages with Levenshtein Distance. In Proceedings of the 6th International Conference on Communications and Cyber Physical Engineering, Hyderabad, India, 28–29 April 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 161–174. [Google Scholar] [CrossRef]
- Santos, B.S.; Silva, I.; Melo, E. Metodologia orientada a ciência de dados em grafos para avaliação de PPGs. In Proceedings of the XV Simpósio Brasileiro de Automação Inteligente (SBAI 2021), Rio Grande, Rio Grande do Sul, Brazil, 17–19 October 2021; pp. 1998–2005. [Google Scholar] [CrossRef]
- Santos, B.S.; Silva, I.; Costa, D.G. Symmetry in Scientific Collaboration Networks: A Study Using Temporal Graph Data Science and Scientometrics. Symmetry 2023, 15, 601. [Google Scholar] [CrossRef]
- Legislative Assembly of Rio Grande do Norte - ALRN. Unale 2024: Director of Technology Management Presents Advances in Artificial Intelligence. 2024. Available online: https://www.al.rn.leg.br/noticia/31558/unale-2024-diretor-de-gestao-tecnologica-apresenta-avancos-em-inteligencia-artificial (accessed on 20 December 2024).
- Meta AI. Llama 3.2 Model Card. 2024. Available online: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct (accessed on 16 December 2024).
- Meta AI. Llama 3.2: Advancing AI for Vision and Language at the Edge and Beyond. 2024. Available online: https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ (accessed on 16 December 2024).
- Robinson, I.; Webber, J.; Eifrem, E. Graph Databases: New Opportunities for Connected Data, 2nd ed.; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2015. [Google Scholar]
- Anthapu, R. Graph Data Processing with Cypher; Packt Publishing: Birmingham, UK, 2022. [Google Scholar]
- Scifo, E. Graph Data Science with Neo4j; Packt Publishing: Birmingham, UK, 2023. [Google Scholar]
- Martins, T.B.F.; Ghiraldelo, C.M.; Nunes, M.d.G.V.; Oliveira Júnior, O.N.d. Readability formulas applied to textbooks in brazilian portuguese. In Notas do ICMSC; Série Computação; ICMSC-USP: São Carlos, Brazil, 1996; pp. 1–15. [Google Scholar]
- Leal, S.E.; Duran, M.S.; Scarton, C.E.; Hartmann, N.S.; Aluísio, S.M. NILC-Metrix: Assessing the complexity of written and spoken language in Brazilian Portuguese. Lang. Resour. Eval. 2024, 58, 73–110. [Google Scholar] [CrossRef]
- Biderman, M.T.C. Dicionário Didático de Português; Editora ática: São Paulo, Brazil, 1998. [Google Scholar]
- McKinney, W. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and Jupyter; O’Reilly Media: Sebastopol, CA, USA, 2022. [Google Scholar]
- Döbler, M.; Groβmann, T. Data Visualization with Python; Packt Publishing Ltd.: Birmingham, UK, 2019. [Google Scholar]
- Anaconda Inc. Anaconda: The Data Science Platform. 2024. Available online: https://www.anaconda.com (accessed on 10 December 2024).
- Google Inc. Google Colab: Hi, This Is the Colaboratory. 2024. Available online: https://colab.research.google.com (accessed on 10 December 2024).
- Tunstall, L.; von Werra, L.; Wolf, T. Natural Language Processing with Transformers; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2022. [Google Scholar]
- Alves, G.; Silva, I. GitHub Repository of This Study. 2025. Available online: https://github.com/conect2ai/legislative-texts-rn (accessed on 5 January 2025).
- Brazilian National Congress. Glossary of Legislative Terms, 2nd ed.; Brazilian National Congress: Brasília, Brazil, 2020; Available online: https://www.congressonacional.leg.br/legislacao-e-publicacoes/glossario-legislativo (accessed on 24 November 2024).
- Legislative Assembly of the State of São Paulo. Legislative Process Manual; Legislative Assembly of the State of São Paulo: São Paulo, Brazil, 2023. Available online: https://www.al.sp.gov.br/arquivos/documentacao/estudos-e-manuais/manual-processo-legislativo/manual_proclegis_2.pdf (accessed on 19 December 2024).
- Abbass, H.; Crockett, K.; Garibaldi, J.; Gegov, A.; Kaymak, U.; Sousa, J.M.C. Editorial: From Explainable Artificial Intelligence (xAI) to Understandable Artificial Intelligence (uAI). IEEE Trans. Artif. Intell. 2024, 5, 4310–4314. [Google Scholar] [CrossRef]
Feature | Description |
---|---|
process_number | An internal control number assigned to the legislative proposal during its processing within the legislative system. |
process_year | The year when the legislative proposal was registered in the legislative process. It complements the process_number to ensure the proposal’s uniqueness and traceability during internal processing. |
proposal_number | A unique identifier assigned to a proposal once it is officially publicized in a legislative session. |
proposal_year | The year in which the proposal was officially publicized. |
proposal_summary | A brief description of the proposal, summarizing its main content and providing an immediate understanding of the legislated matter. |
proposal_register_date | The date when the proposal was formally registered in the legislative system. |
proposal_initiative_description | The name of the entity or individual responsible for initiating the proposal. |
proposal_initiative_type | The type of initiative associated with the proposal, indicating whether a parliamentarian or an internal/external entity authored it. |
proposal_subject_description | A textual description of the thematic subject or area the proposal addresses (e.g., education, healthcare, infrastructure). |
proposal_type | The proposal classification is based on its type of normative instrument (e.g., Bill of Law, Bill of Supplementary Law, Bill of Legislative Decree, Bill of Resolution, Constitutional Amendment Bill, or Request). |
doc_id | A unique identifier for the document within the legislative system. |
doc_subject | Commonly, it is the document name related to a specific topic addressed within the document or its type. |
doc_type | The type of the document based on its purpose (e.g., Engrossed Bill, Recommendation, Memo, Communication, or descriptions similar to the proposal_type examples). |
doc_text | Full textual content of the document with HTML formatting tags. |
doc_inclusion_date_proposal | The date when the document was included within the proposal. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Oliveira, G.L.A.d.; Santos, B.S.; Silva, M.; Silva, I. Exploring Legislative Textual Data in Brazilian Portuguese: Readability Analysis and Knowledge Graph Generation. Data 2025, 10, 106. https://doi.org/10.3390/data10070106
Oliveira GLAd, Santos BS, Silva M, Silva I. Exploring Legislative Textual Data in Brazilian Portuguese: Readability Analysis and Knowledge Graph Generation. Data. 2025; 10(7):106. https://doi.org/10.3390/data10070106
Chicago/Turabian StyleOliveira, Gisliany Lillian Alves de, Breno Santana Santos, Marianne Silva, and Ivanovitch Silva. 2025. "Exploring Legislative Textual Data in Brazilian Portuguese: Readability Analysis and Knowledge Graph Generation" Data 10, no. 7: 106. https://doi.org/10.3390/data10070106
APA StyleOliveira, G. L. A. d., Santos, B. S., Silva, M., & Silva, I. (2025). Exploring Legislative Textual Data in Brazilian Portuguese: Readability Analysis and Knowledge Graph Generation. Data, 10(7), 106. https://doi.org/10.3390/data10070106