Development and Evaluation of an Intelligence and Learning System in Jurisprudence Text Mining in the Field of Competition Defense

: A jurisprudence search system is a solution that makes available to its users a set of decisions made by public bodies on the recurring understanding as a way of understanding the law. In the similarity of legal decisions, jurisprudence seeks subsidies that provide stability, uniformity, and some predictability in the analysis of a case decided. This paper presents a proposed solution architecture for the jurisprudence search system of the Brazilian Administrative Council for Economic Defense (CADE), with a view to building and expanding the knowledge generated regarding the economic defense of competition to support the agency’s ﬁnal procedural business activities. We conducted a literature review and a survey to investigate the characteristics and functionalities of the jurisprudence search systems used by Brazilian public administration agencies. Our ﬁndings revealed that the prevailing technologies of Brazilian agencies in developing jurisdictional search systems are Java programming language and Apache Solr as the main indexing engine. Around 87% of the jurisprudence search systems use machine learning classiﬁcation. On the other hand, the systems do not use too many artiﬁcial intelligence and morphological construction techniques. No agency participating in the survey claimed to use ontology to treat structured and unstructured data from different sources and formats.


Introduction
Jurisprudence is a set of interpretations of laws and decisions made by courts regarding similar cases. Thus, it indicates a common and preeminent understanding of the judges of a particular court about a set of events, assisting them in making decisions in similar future documents in similar situations, thus reducing the effort of development teams in the elaboration of new legal documents.
Our main findings were: (1) Most agencies use a textual database processing system; (2) the indexing engine most used is Apache Solr. In addition, the Java programming language is the most used in developing textual database processing systems in Brazil; (3) most Brazilian agencies do not use Artificial Intelligence in their solutions, and (4) less than 15% of the public administration agencies in Brazil comply with the Brazilian General Data Protection Law (LGPD).
We organized the rest of the work as follows: Section 2 introduces the background, related works compared to this study, and the case study. Section 3 presents the methods we have employed to conduct the research. Section 4 provides answers to each research question and discusses some of our findings and the implications of this research. Section 5 discusses some limitations and threats to validity. Finally, Section 6 concludes our work and presents directions for future works.

Background
A Jurisprudence system is a solution that makes available to its users a set of collegiate or court decisions, i.e., the recurring understanding of [5] decisions, as a way of understanding the law. Jurisprudence consists of the similarity of legal decisions that provide stability, uniformity, and some predictability of the analysis of a decided case. A Jurisprudence system enables the search of documents related to the topic in reference collections and databases internal and external to a given organization. Generally, resources and technologies are used in the development of a search system, such as: Facets [6], indexing [7], ontology [8], Text Mining [9,10] and natural language processing (NLP) [10].
Ontology is a taxonomy-based knowledge representation model used to present, describe and express a specific domain. Collecting the terms of a domain, as well as specifying its structure, is of great importance and one of the essential parts of an ontology [11]. Ontologies organize and structure information that describes a domain to make it understandable by all interested parties. Ontologies can establish interconnections between Information Systems when they share or make available parts or all of it for any purpose [12].
For purposes of definition and conceptual limits in the scope of knowledge management, Martins [13] defined a taxonomy as a structuring and hierarchical element, which classifies and characterizes the classes and subclasses used in the constructions of an ontology. Thus, taxonomies work towards organize the information, while ontologies seek to establish semantic relationships between concepts (classes), which assigns characteristics (properties)to the terms (attributes). The essential components of an ontology are [13]: • Classes: Sets, collections, concepts, programmable classes, types of objects or things, organized in a taxonomy; • Relationships: Represent the type of interaction between concepts or describe adjectives or qualities of classes; • Axioms: Used to model always true sentences (constraints); • Instances or individuals: Used to represent specific elements of the classes, that is, the data itself.
An ontology supports knowledge sharing and reuse by proposing its semantics for the various subject areas. Due to the structural and formal support of domain schema representations, ontologies enable the automation of structured and unstructured data processing [8], therefore is thus the core of the Semantic Web. Ontologies are considered as an alternative to solve data heterogeneity problems.
Indexing is the process of searching content from the selection of keywords and concepts for document retrieval. In indexing systems, the automation of this process uses methods that perform word or n-gram extraction as an alternative to keyword indexing, where the index formed points to the documents that contain them [7]. For conceptual purposes, n-grams are fragments of selected words that bring good search results when used in indexes [7].
For cases of indexing jurisprudence search systems, the indexing and data search tool uses the concept of formation of classification and textually indexed knowledge bases, which allows promoting the consolidation, in the same platform, of the processes judged and decisions taken by the courts, as well as other document collections of interest. This consolidation can support the formation of knowledge of specific jurisprudence and favor the homogeneity and predominance of trends in decision-making in processes with the same content.
Information extraction automatically deriving structured and unstructured data from text, using techniques such as facets. Facets are terms classified and selected from a previously indexed text, in order to facilitate the search process, capable of covering different ranges of values and reflecting some identity of the document [14], i.e., they are textual elements classified to build composite subjects. Therefore, faceted search presented itself as an efficient technique that can significantly reduce the information overload [6] to the user.
Faceted search allows the user to explore a data collection by applying filters in an arbitrary order [15], where the information elements are organized by a classification system using facets and enables the user to elaborate their search progressively, in a refined way, presenting the different choices options and with accurate results [16].
Artificial Intelligence techniques for information retrieval are an essential component in legal science [17]. Artificial Intelligence in such a system is done using Text Mining, and Machine Learning techniques [9]. Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed. First, training of the model is done by constantly feeding it data, and after that process cross-validation, which allows to estimate the training error and validate the selected data set in the test. In addition, machine learning can be used to extract the parts of a legal document, identify the correlations and generate a document structure file based on a legal ontology [10].
Text Mining is a resource for organizing and structuring data extracted from collections or discovering textual knowledge in databases by natural language processing (NLP) tasks. It generally refers to the process of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [10]. In the document, information is extracted and converted into structured data, and then knowledge is extracted by parts or fragments of text by combining patterns. Textual structure in NLP is a directional relationship between text fragments, which methods handle to recognize, generate or extract parts of textual expressions and infer the relationship of the parts to the whole [17].

Related Works
Barros et al. [18] presented a study in which supervised machine learning techniques classified documents related to judicial decisions in order to ascertain the opinion trends that Brazilian courts have. The authors applied a methodology to process the judicial decisions from the Regional Labor Court (TRT) of the 3rd Region, located in the Brazilian state of Minas Gerais, for data mining to extract and process the information present in the judicial documents and made use of natural language to perform the automatic classification of the documents, reaching more than 90% accuracy when indicating the tendency of each judge in a sentence.
Gomes and Ladeira [19] presented a study related to the use of a text search tool of the Superior Court of Justice (STJ) and presented the performance of searches based on Boolean queries with logical and proximity operators. The authors concluded that the court's system could be improved to facilitate the search for decisions already made by the STJ, optimize access to jurisprudence, and follow the evolution of the court's understanding on several themes. The improvement was possible because the Text Retrieval Conference (TREC) technique compares textual similarities. In addition, the authors found that the Best Match 25 (BM25) and Term Frequency Inverse Document Frequency (TF-IDF) models enabled an improvement in search performance, obtaining better results than semantic models based on prediction such as Word2Vec and Bidirectional Encoder Representations from Transformers (BERT).
Bueno et al. [20] used Artificial Intelligence to assist legal professionals in searching for jurisprudences in quality databases on judicial decisions. The authors' textbase was powered by relevant legal cases and identification of appropriate jurisprudence for retention, with automatic extraction of information from the document into the database, integrated with a thesaurus based on standard legal terms and with retrieval based on similar terms.
Ordoñez et al. [21] presented the PROJLAW application with support for Natural Language Processing (NLP) to analyze the texts that make up a court judgment. NLP and linked used the data for document identification, indexing, and recommendation. After seeking validation of the system through user experience, the application produced answers for the searches performed efficiently and with keyword insertion during the search. The authors concluded that the more keywords used, the greater the search accuracy.
Aletras et al. [22] addressed the use of Artificial Intelligence with natural language processing for the analysis of judicial decisions in building predictive models that reveal patterns that guide judicial decisions in order to be able to predict possible future decisions. The authors proposed to build a tool to predict patterns from the European Convention on Human Rights (ECHR) using the supervised machine learning (SVM) algorithm [23].
Silva et al. [24] presented their research and development project, called VICTOR, aimed at solving recognition pattern problems in texts from court cases belonging to the Supreme Court (STF). Differently from previous researches, in this work, the authors proposed a solution to speed up the analysis of judicial decisions directed to the STF and identify which cases are linked to particular subjects of general repercussions, such as competition, price taking, etc., using Convolutional Neural Network (CNN) [24] and Natural Processing Language.
The main difference between existing Jurisprudence Search Systems and the system proposed in this research is that the developed system applied evaluation techniques and iterative redefinitions in the verification and validation of all the functionalities of the proposed solution and used the accessibility and usability guidelines proposed in the literature during the system development process. Thus, we can infer that the developed system follows the best practices used in existing Jurisprudence Search systems. Moreover, one of the differentials of the developed Jurisprudence Search system was that it was submitted to a usability evaluation by four experienced usability experts [5]. Canedo et al. [5] performed the usability heuristic evaluation of the Jurisprudence Search system, using a set of 13 usability heuristics and their respective sub-heuristics, considering the system user, the context of use, the task, and the cognitive load as usability factors [25]. Finally, the Jurisprudence Search system development team added all the suggestions for improvement suggested by the usability experts in the final version of the system made available to the end-users.
Regarding the technological aspects, in the collection and loading processes of structured data, modern resources of early data processing were used, implemented concerning existing documentary resources and external data environments. We can highlight the introduction of statistical concepts in the inference of natural language understanding and discourse analysis to form a supplementary knowledge base about the methodologies and techniques used. In addition, we performed data preprocessing, transformation, and cleaning [26].
Weber et al. [27] defined the concept of Intelligent Jurisprudence Research (IJR) as the activity of performing a jurisprudence search using a computational tool with Case-Based Reasoning (CBR) systems. According to the authors, data retrieval systems that use statistical methods have low accuracy. Thus, the authors consider that the knowledge-based indexing process is more efficient by applying case-based reasoning, an artificial intelligence technique that models aspects of human cognition to solve expert problems. Court cases are described in natural language, and this makes systematic reading difficult. Therefore, it requires case engineering efforts. The model proposed by the authors converts textual decisions into cases by defining the attributes comprising the issues that best represent the experiences described in the judicial decisions and employing mining methods to extract values for the attributes automatically.
Giacalone et al. [28] proposed a statistical model for text mining on a web database to verify the duration of a trial, the solution adopted by the judge, and its correspondence with other stored decisions. The model was based on a knowledge base and used a hybrid approach to search for text similarities and semantic relations between two concepts. The authors tested the proposed model on a repository containing more than 100 sentences.
Houy et al. [29] developed a system called ARGUMENTUM to search for arguments, justifications, and refutations of statements to analyze judicial decisions. The authors' used techniques from argumentation mining, Support Vector Machines (SVM), Argument Markup Language (AML), and Natural Language Processing (NLP). Pasquale and Cashwell [30] made a critique of the indiscriminate use of prediction techniques in the judicial system and their impact on civil law, questioning the social utility of prediction models when applied to the legal system. The authors stated that using algorithms to perform predictive analysis in judicial contexts is an emergent jurisprudence of behaviorism since it relies on a fundamentally mental process model as a black box of transforming inputs into outputs. Furthermore, in dealing with a system created by humans, the authors stated that predictive analytics could be biased instead of performing informed decision-making since the people affected by automated classification and categorization cannot understand the reason for the decisions that affected them.
The selection of these techniques was to build and expand the knowledge generated regarding the economic defense of competition to support the procedural business activities of the organization. As sub-products of this process, we have the automated revision of controlled vocabularies and resources for structuring semantic and ontological databases.

Case Study
In order to assist the legal process managers of the Administrative Council for Economic Defense (CADE) developed to retrieve information stored in the Electronic Information System (SEI) and other databases, the technical solution named Jurisprudence Search System, which can index and search the information requested by the user within the scope of processes already defined by CADE. Furthermore, the data indexing and search system use the concept of forming textually indexed, classificatory knowledge bases, which allows the consolidation, on a single platform, of judged cases and decisions made by CADE, as well as other collections of documents of interest, in the consolidation of the search to support the formation of specific jurisprudence knowledge. As a result, the system favors the homogeneity and predominance of trends in decision-making by CADE's Commissioners and supports managers in competition matters. During the development of the Jurisprudence System, the project researchers analyzed and determined five important evolutionary axes in the understanding and treatment of the research problem of research and development (R&D), which are: • Infrastructure, APIS, and Interfaces: It deals with the infrastructure requirements (deployment and configuration), interface, navigation, and Apache Solr [40] handling in its searches; • Collection, Retrieval, and Indexing: Deals with the specific configuration for the "Jurisprudence" collection; • Information Structure: Development of a suitable ontology for the formation of shared or unshared classificatory data, in order to incorporate new data collections in the short and medium-term, to increase the investigative capacity of the intended system; • Analysis and Morphology: To inform CADE of the results of the statistical analysis of the incorporated documents; • Machine Learning, Research and Investigation: Support to models, mechanisms, and techniques of adaptive and evolutionary analysis of the classifications and researches, by the automated use of the results obtained in the treatment of Analysis and Morphology.
We used machine learning classification because we considered it adequate for: (a) Performing more restrictive filters through more qualified aggregations than simple text search; (b) formation of clusters in the identification of groups of documents by interest from a given group of reserved words; (c) expansion of visual interpretation capabilities through document relation graphs and word cloud formation techniques; (d) application of advanced automatic summarization techniques; (e) interpretation of named entities and their relations across several documents, and (f) support for the formation of controlled vocabularies through n-gram validation. We chose as system development platform, Apache Solr [15,40,41]. Solr allows for indexing and scalable searches and facets for managing searches, occurrence highlighting, and advanced analysis capabilities. Through this tool, the search system can provide advanced search filters, which can be conditional (where it adds specific fields to return an exact answer from the system), search with specific characters/terms, by proximity or Boolean operators, search by relevance, phonetic search with spell checker and autosuggestion. As a search result, the system features word highlighting, pagination and sorting, controlled vocabulary synonyms, the definition of term-stopwords, and document standardization. In addition, the Jurisprudence system allows the indexing of various file extensions, such as PDF with OCR, Word documents, and Excel Spreadsheets. Figure 1 shows an example of a search using the developed system. Collections can perform the search in the system's database, i.e., judgments from the Federal Audit Court, guides and publications, jurisprudence, legislation, news, and technical opinions, selected according to the end user's needs. The search results show all documents with the word searched in highlight ("Cartel"). In addition, the company names returned after the search was protected and named with <blind name>. The system returns ten search results per page, and for each of them, we have the options of process data, related documents, summaries (summary of a decision), verbatim (sequence of key words, or expressions indicating the subject discussed in the text), device (rule resulting from the judgment), and conclusion (final decision), depending on the document type searched and the collection it belongs. Moreover, the user can add the search to a knowledge basket, available for future searches. Algorithm 1 presents the code for this search.
In the indexing stage, we define the essential terms, create the term vectors, and apply the TF-IDF and the relevance of the terms, considering the synonym treatment performed previously, as shown in search Algorithm 1. Then, the machine learning step consists of indexing and classifying the data. We also use dictionary technologies enriched in the previous stages, allowing a model with faceted and word search with more significant support for knowledge formation. The end-user consumes a structure based on facets and pivots during the classification phase, following the selected preferences and its query routines. In the machine learning step, we first performed the model training using the Learning to Rank (LTR) technique [42], and the Support Vector Machine (SVM) algorithm [43]. Next, we perform cross-validation, that is, a process that estimates the training error and validates the dataset selected for training.
In general, the jurisprudence system allows the user to enter a set of keywords and retrieve documents related to that set, also considering synonyms relevant to the search.
The system interface provides a search on classified data in which the returned results are later related laterally as a more detailed filter, either by subject categorization or related tags, composing a search filter. When consolidating the search, it should be possible to treat the results by relevance, most accessed, or referenced documents. The architecture of the proposed solution was developed using the client-server model. Figure 2 presents the architecture of the jurisprudence system developed.
At each cycle of re-evaluation of classification from other unsupervised models performed by the proposed architecture, the model itself will perform feedback using new training bases. For example, suppose we train the model for three documents (words), and the documents are being evaluated with two new documents (words). At the end of this process, the training base will be fed back with five documents (words), updating the training base with five documents and calibrating the model for further training. This process adjusts the classifications (or any data from the unsupervised treatment), and the proposed model in the architecture (Figure 2) can treat a larger supervised (trained) database.
The proposed architecture ( Figure 2) has public and private access to the client interface code domain held in a demilitarized zone (DMZ) but with security guarantees (HTTPS, attack treatment, and others.) performed by an edge firewall. In this context, to ensure high availability in the access to the client interface codes, there is an Apache Reverse Proxy (#1) acting after the edge firewall that filters the results and, mainly, performs the load balancing between the two servers that provide high availability (fault tolerance and round-robin load balancing). It is important to emphasize that the data domain and the properly authenticated Solr search APIs run on Cade's militarized network (MZ). Thus, the codes in the client interface to access the data domain through an adequately secured call, and the second layer Apache Reverse Proxy (#2), maintains high availability. Only the Apache Reverse Proxy (#2) has specific access directives, using header elements of each call to the Solr APIs that ensure the authenticity of the requesting user, in this case, the Proxy itself.
The high availability of the Solr environment is guaranteed by a balanced model of servers configured adequately through a process using the Zookeeper model, according to the rules defined by Apache regarding Apache Solr instances. This way, 2 Apache Solr servers and 3 Zookeepers servers (2n + 1 of the "n" Apache Solr servers) were positioned both in serving the requests coming from the NodeJS interface layer (through Apache Reverse Proxy #2) and the internal data loading/downloading processes for making indexed textual data available in support of the model (Figure 2). Furthermore, the channel issues width, memory, and disk, specific to the model, were measured and applied as much as possible, according to the recommendations of each element/layer. On the client-side, we use the AngularJS framework [15,44], Bootstrap 4 [45], the HTML5 [46,47] and the CSS 3 [46], for building the front-end. The application consumes a REST type API [48], built by means of the NodeJS framework [49], which makes the requests and does all the processing of the information stored in the databases used. Thus, the front-end of the jurisprudence system will be responsible for rendering on the screen all the functionalities that will be available to the users of the application (Figure 2).
Commonly, each interaction performed by the user in the application results in a request to the controller [15,50], between the front-end and back-end modules. This request can be anything from a page change (where new information must be loaded) to a new query to the jurisprudence system database. It is important to note that this request exchange interaction between the different modules (back-end to front-end and vice-versa) uses the HTTP protocol [51] through asynchronous requests (Figure 2).
On the back-end of the jurisprudence system, we use Node. js technology [49,52], as the execution environment and in this environment two modules were implemented: (1) Solr API [40] together with MySQL Client Driver [53] to communicate with the SEI database using the MySQL DB database management system [17,54,55]; (2) the backend application is a REST API [48], which must interact with the Solr API on "/api/select" calls. The Solr API is responsible for accessing the Lucene data persistence kernel. The communication between the Angular client and Solr via API serves as a proxy that controls its access. In the back-end API, we use a module to communicate with the Solr environment ( Figure 2).

Method
In this paper, we performed a literature review to investigate the characteristics of existing Jurisprudence Search systems to identify some challenges and functionalities that we could incorporate in its development for a Brazilian Federal Public Administration agency. In addition, we surveyed to identify the Jurisprudence Search systems used by Brazilian agencies and their features and functionalities.

Research Questions
In order to achieve the main goal of this research, the following research questions have been defined to help achieve this main objective: RQ.1. Which agencies of the Brazilian Public Administration use a Jurisprudence Search System, and what are the characteristics of these systems?
RQ.2 What functionalities offer the Jurisprudence Search Systems?
To answer the research questions, we conducted a literature review and a survey of Brazilian agencies. The survey was composed of 52 questions and addressed to all agencies of the Brazilian Public Administration. In total, there are 195 agencies, divided among the executive, judiciary, court of accounts, public ministry, and legislative branches, as shown in Figure 3. In addition, we contacted the 195 agencies through institutional e-mails. The questionnaire was applied between June 2021 and July 2021 and had 107 agencies (55% of the total), and the average response time was 12 min. Table 1 contains all questions that the agencies in the survey answered.

Q1
What is the name of your agency?

Q2
What is the power rating of your agency?

Q3
What is the sphere classification of your agency?

Q4
Does your agency have a textual database processing system? Q5 Was the system developed in-house or contracted out?

Q6
What database is used in the textual database processing system?

Q7
What is the "other" database option?
Q8 What programming language is used in the textual database processing system?

Q9
Describe the "other" database option.

Q10
If the system is public, what is the link to the textual database processing system? Q11 Does the system have a user's manual?

Q12
If the manual is public, what is the link to the user manual?

Q13
What indexing engine is used by the textbase processing system?

Q14
Describe the "other" option of the indexing engine question.

Q15
What are the document formats of the textual bases processing system?

Q16
Describe the "other" option of the document format question.

Q17
In the textual base, do you have digitized documents (obtained from paper scanning)?

Q18
What is the ratio of digitized documents to native digital documents? Q19 Is the system compliant with LGPD requirements?

Q20
Has a usability analysis of the textual basis processing system been performed?

Q21
Does the textual base's processing system has filters by categories (date of issue, type of process, units, areas of interest, subjects, among others)?

Q22
Does the textual base's processing system use logical operators (and, or, not, and others)?

Q23
Does the textbase processing system offers the option of exporting the results (pdf, CSV, etc.)?

Q24
Does the textual database processing system index the contents of other agencies?

Q25
Does the textual processing system index various documents (PDF, Word, Excel, other)?

Q26
Does the textual processing system using any method to define the relevance of documents?

Q27
If it does, please describe the method used to define the relevance of the documents.

Q28
Does the textbase processing system uses a Controlled Vocabulary?

Q29
If the vocabulary is public, please put the link in the field below.

Q30
Does the textbase processing system uses an ontology?

Q31
If it does, please describe the ontology used.

Q32
Does the textbase processing system using any multimedia data extraction process? (For example, deduplication of audio and video files).

Q33
If a multimedia data extraction process exists, describe it.

Q34
Does your agency use statistical methods in the textual base processing system?

Q44
Is any natural language processing (NLP) technique used in the textual base processing system?

Q45
If there is, please describe the NLP techniques used.

Q46
Is there any study/publication on the use of artificial intelligence in the agency's textual base processing system?

Q47
Is there any study/publication on the use of artificial intelligence in the agency's textual bases processing system?

Q48
Share the link to the study/publication or describe them.

Q49
What functionalities do you think a textual bases processing system should have? Among the 107 agencies that participated in the survey, 55 are part of the Judiciary. The Executive had 28, 14 from the Auditors' Court, 8 from the Public Ministry, and 2 from the Legislative. Concerning the classification of the sphere of the body, 66 were from the Federal Public Administration, 38 were in the provincial level, and 3 were municipal, as presented in Figure 4. Figure 5a presents that 42 agencies reported that they do not use a textual basis processing system, and 56 stated that they do. In addition, 44 agencies stated that the textual bases processing system was developed internally by them. Figure 5b shows that 11 agencies contracted the system.  Regarding the database used in the textual database processing system, 14 agencies reported using Apache Lucene to perform indexing and retrieval of textual data. Additionally, 12 agencies use Oracle. 9 agencies use MS-SQL Server. 5 agencies use Postgree. 3 agencies use BRS/Search. Finally, 2 agencies use MySQL/Maria DB. Moreover, only one agency uses ElasticSearch, IBM DB2, Oracle/BRS Search, or Solr. In addition, one agency uses the System Database, as presented in Figure 6a. Twenty-four agencies used the Java programming language to develop the textual base processing system, seven agencies used the Angular language and Python, six agencies used PHP, three agencies developed the system in the C# language. Finally, one agency from the survey use the Apache Lucene Java, ASP, ASP + .Net, C#, SQL Server Full Text Search, Delphi, Java, Google Search Appliance, JavaScript, and Freemarker languages, respectively, as presented in Figure 6b. Thirty-six agencies stated that the developed system does not have a user's manual, and twenty-one stated it does. Regarding the indexing engine used by the textual database processing system, 19 agencies reported that they use Solr, 14 agencies use Elastic Search, 10 agencies use a custom relational database, 6 agencies use BRS/Search, and 2 agencies use Apache Lucene. Figure 7a shows the databases: Google Search Appliance, Hibernate Search, Apache Lucene, Sharepoint by one agency, and SQL Server Full-Text Search only one agency uses them. Regarding the formats of the documents in the textual base processing system, 33 agencies informed that they use the pdf format, 23 in HTML, 20 in Text in the database, 17 in PDF/A, 10 in Word/Excel format, two in CSV format, two agencies use ODT and RTF format. One agency claimed to use XML format, as shown in Figure 7b. Figure 8a shows that twenty-seven agencies informed that the textual base had digitized documents, and they obtain the documents from scanning paper documents, and 29 informed that they do not have. Figure 8b has the proportion of digitized documents concerning already digital documents, 35 agencies stated that they have from 0% to 20%, nine agencies between 21% to 40%, eight agencies between 41% to 60%, four agencies between 61% to 80% and only one agency reported having between 81% to 100% of their digitized documents. Regarding whether the Jurisprudence Search system developed by the agencies that answered the survey is in compliance with the principles of the General Law on Personal Data Protection (LGPD), only 15% of them said yes, 52% were neutral, and 33% disagreed that the system complies, as presented in Figure 9 (Q19). This result is a preoccupying factor since all systems developed by Brazilian agencies must comply with the LGPD. In this sense, the system developed in this research meets this requirement, i.e., the Jurisprudence Search system developed for CADE is compliant with the LGPD. Furthermore, 39% of the agencies stated that they perform a usability analysis of the textual database processing system, 45% were neutral, and 16% stated that there is no usability analysis, as presented in Figure 9 (Q20). LGPD compliance and usability best practices.

RQ.2 What Functionalities Are Offered by the Jurisprudence Search Systems?
Concerning the functionalities of the textual bases processing system, 47% of the agencies participating in the survey informed that the textual bases processing system has filters by categories, such as date of issuance, type of process, units, areas of interest, subjects, among others. On the other hand, 41% of the agencies were neutral, and 12% of agencies stated that the system developed does not have this functionality, as presented in Figure 10 (Q21). Figure 10 (Q22) shows 69% of the agencies strongly agree and agree that the textbase processing system uses logical operators (and, or, not, among others), 15% of the agencies were neutral, and 15% strongly disagree and disagree. 40% of the agencies strongly agree and agree that the textual base processing system offers the possibility to export the search result to pdf, CSV format, among others, 35% of the agencies were neutral, and 25% strongly disagree and disagree (Figure 10 (Q23)). Regarding whether the textual base processing system indexes content from other agencies, only 38% strongly agree and agree, 50% were neutral, and 12% strongly disagree and disagree (Figure 10 (Q24)). Hence, 38% of the agencies stated that the textual base processing system indexes various documents, such as digital, PDF, Word, Excel, etc. However, 33% of the agencies were neutral, and 29% of the agencies strongly disagree and disagree (Figure 10 (Q25). In addition, 24% of the agencies stated that the textbase processing system uses some method to define the relevance of documents, 53% were neutral, and 24% stated that the agency does not use any method (Figure 10 (Q26)). Among the methods used to define document relevance, some agencies stated: "Lucene's standard relevance calculation is used, based on term count, term frequency, inverted document frequency, and field size." "Relevance by publication date." "The Ranking features offered by SQL Server Full Text Search are used, for sorting the results." "The ElasticSearch database has a method of gauging document relevance from the queried term." Figure 10 (Q28) shows that 27% of the agencies stated that their textual base processing system uses a controlled vocabulary. 53% of the agencies were neutral, and 20% stated that the system used by the agency does not have a controlled vocabulary. Toward, 12% of the agencies' textbase processing systems that participated in the survey reported using some Ontology in their solutions. 25% of the agencies were neutral, and 62% reported that they do not use any Ontology (Figure 10 (Q30)). This finding reveals that the Jurisprudence Search system developed in the context of this research has as one of the main differentials compared to the systems developed by other agencies in its development, the use of Ontologies. Only one agency participating in the survey stated that the textual base processing system uses multimedia data extraction, such as deduplication of audio and video files. This finding also reveals a differentiation from our system Figure 10 (Q32).
Just 18% of the agencies reported that there is some study/publication on the use of Artificial Intelligence in the agency's textual base processing system, 36% were neutral and 45% reported that there is no study for the use of AI, as presented in Figure 10 (Q46).
Forty-nine agencies reported that their agency does not use statistical methods in the textual bases processing system. Eight agencies stated that they used supervised classification, document classification, similarity, and document clustering.
About text mining techniques the agency uses in the textual bases processing system, 12 agencies reported document classification, and ten agencies use document similarity. Six agencies reported that they use document clustering, and six agencies reported that they use document summary. 2 agencies reported that they use recognition of named entities and two agencies use topic modeling, one agency reported that it uses chatbots, 01 agency uses the frequency of teams, as presented in Figure 11. Concerning the supervised Machine Learning techniques used for text treatment, 70% of the agencies informed that they use the classification technique, 20% use the regression technique and 10% use the sequential patterns, as presented in Figure 12a. Forty agencies reported that they use the clustering technique as an unsupervised Machine Learning technique. Twenty agencies use sequential patterns. Besides, ten agencies use deviation detection, ten agencies principal component analysis, and ten agencies reported that they use vectorization (Figure 12b). Thirty-seven agencies reported no extraction technique or template for specific parts of documents in the system they use, such as identification, comments, and conclusion. Only 16 agencies informed that the system they use has some extraction technique or model. Among them were textual search by terms, search by menu and structured abstract, models based on Machine Learning, and models based on Grammars and Characteristics.
Forty-four agencies use Natural Language Processing (NLP) technique in the textbase processing system, and nine agencies said yes. Among those mentioned are (a) preprocessing and vectorization of the content of the case records; (b) data collection, raw text extraction, sentence division, tokenization, normalization (systemization, lemmatization), removal of empty words and part-of-speech tagging; (c) part-of-speech tagging, machine learning (classification, clustering, named entity recognition), chunk regular expression, N-gram parsing, feature-based grammars; and (d) tokenization, stop words, stemming, thesaurus, vectors.
We also investigated the perception of Brazilian public administration agencies regarding which functionalities a textual base processing system should have. Some of the answers were: "Systems should offer similar content identification." "Systems should also perform the search for the content in full, not just the descriptors." "The systems must allow searching by relevance through advanced filters, such as document type, unit, subject, signatory and dates. In addition they should perform synonym handling and stemming." "Systems should perform keyword searches, Boolean operators for jurisprudence search, semantic similarity search, cluster analysis and abstract generation." Regarding suggestions for improvements to the agency's existing textual base processing system, some responses were: "Insertion of other databases, identification of a "paradigm decision" in the results of the jurisprudence search, identification of citations to decision contents with binding effects in the decisions resulting from the search. Improve response time and use machine learning techniques to improve result ranking." "Legislative reference and search using fuzzy logic and Artificial Intelligence. In addition, an improvement in document indexing, user interface, and user experience design need to be incorporated." Concerning how the use of Artificial Intelligence, Machine Learning and Text Mining techniques can improve the agency's finalistic activities, some answers were: "Grouping similar processes and offering a document template to treat each group, offering greater efficiency in audit actions in the selection of objects of greater relevance, risk and materiality. In addition, assisting in the decision making of the subject areas when preparing the annual inspection plan." "Through the optimization and automation of manual and repetitive work, allowing greater agility in the analysis of processes. Moreover, in the classification and recognition of textual patterns, it is possible to search for procedural pieces and opinions that can help and speed up the construction of new opinions. Thus, the use of these techniques are promising in the sense of enabling the sharing and dissemination of knowledge." "Automation of routine tasks, allowing the team to focus on more strategic activities; Greater assertiveness and speed in performing activities; Quick analysis of large volumes of data, providing better subsidies for decision making; Analysis of historical and current facts to make predictions about future events, enabling, for example, better planning in inspection activities and behavioral analysis of the regulated entities aiming to evolve the Agency's regulation, always seeking to improve the return to the population."

Discussion
The systems for processing textual bases are present in most Public Administration agencies, but we can observe that the solution architecture model for implementing these systems is not consensual and does not indicate paths of best practices and the standards adopted. However, the quantitative analysis presented in the survey shows the technologies and techniques used by these systems that are in line with the model proposed by CADE, such as the predominance of Apache Lucene as a text search library, since this technology allows high-performance searches in large volumes of information.
In the Jurisprudence Search System, the proposed solution architecture uses Apache Solr to process data from the SEI database for data indexing purposes, responsible for accessing Apache Lucene resources. The Java language appears as predominant in the survey results and is used in CADE's system because it is the native language of Apache Lucene.
Concerning the availability of search resources applied to the surveyed systems, most of them have similarities with the functionalities implemented in their search systems. For example, many of the respondents cited the use of filters, logical operators, PDF file treatment, and export. Therefore, the solution proposed by CADE, besides implementing these functionalities and resources, presents the differential of basket resources, search history, and highlights.
In the development of CADE's Jurisprudence Search system, we used Artificial Intelligence techniques in conjunction with statistical techniques to perform natural language processing and discourse analysis techniques to form a supplementary knowledge base, text mining, and machine learning. Unfortunately, most of the agencies participating in the survey did not use any Artificial Intelligence techniques. However, some agencies mentioned the use of Machine Learning techniques for data classification and clustering. Thus, we can infer that CADE's Jurisprudence Search system differs from other systems used by Brazilian public administration agencies by using Artificial Intelligence and Ontology in the proposed solution.
It is noteworthy that Machine Learning techniques can be categorized into supervised and unsupervised and applied alone or combined, depending on the needs and according to the defined database. Therefore, it is necessary to perform a preliminary analysis to identify the appropriate techniques for each scenario. The technologies used in this research were Artificial Intelligence using Machine Learning and Text mining. The ML methods and techniques used in this work are for information retrieval, such as extraction (using facets), classification (clustering, summarization, named entities), indexing (by extracting n-grams), and natural language processing.
The open questions of the survey brought important information and perceptions about the opinions of the experts of each organization that use search systems, such as (i) improve usability, accessibility, user experience in the use of search systems; (ii) improve text indexing; (iii) index other types of content such as audio and video of plenary sessions; (iv) perform a search by relevance, keywords, metadata, advanced filters, treatment of synonyms; (v) export the documents and the search results; (vi) recommend other documents; (vii) identify documents with similar content, and (viii) incorporate other textual bases. These insights were essential for the continuous improvement of the Jurisprudence Search systems.

Limitations and Threats to Validity
As in any research that investigates users' perceptions concerning a given scenario, we have some threats to validity. Regarding the fidelity of the participants' answers, we cannot guarantee that all of them answered according to the actual scenario of the Brazilian agencies and if the information represents all the technologies and techniques applied in the development of the Jurisprudence Search systems used by them. To mitigate this threat, we did not make the information of all survey participants public. In addition, the results of the quantitative data analysis would not impact the evaluation of the agencies by the controlling bodies.
Regarding the system developed in the actual case study of this research, the Jurisprudence Search system currently developed has some limitations, which are: (a) There is a current inability of the solution to apply the statistical and stochastic processes using Artificial Intelligence techniques in interpreting the indexed terms that exist in the database, posing a challenge in transforming the processes that use simple natural language processing to the intended final understanding, which supports the marking of summaries and review of the main sentences with their terminologies adequately supported; (b) in dealing with security aspects, the imposition of multiple levels of access produces undesirable latencies, with the exponential growth of indexed bases of legal documents and other collections with Apache Solr; (c) the use of free visual components that have a high impact on the execution of screen templates, creates situations of anticipated and synchronous treatment in the construction of screens, which results in an increase of time in the total loads of the results data; (d) the analytical view of the data in clusters and in correlations in general is still being implemented, one of the fundamental issues of support to the internal activities of analysis and interpretation of legal pieces referring to some theme or subject; and (e) regarding the ontological treatment of data, such as, for example, the one that allows the observation of the interconnection model among the several named entities and their physical and legal relations within their activity sectors and shareholding, object of CADE's observation, has not been implemented in the available solution yet. As a mitigation to these factors, we are developing new functionalities that will meet the needs of the Jurisprudence Search system.

Conclusions
This paper presents a solution for CADE's Jurisprudence Search system to perform textual database processing. First, we performed the collection, retrieval, and indexing of the terms to build the database. Afterward, we used Artificial Intelligence techniques, statistical methods, summarization techniques, indexing, named entity recognition, natural language processing, and ontology. These techniques provided the Jurisprudence Search system a differential to other Jurisprudence Search systems used by other Brazilian agencies. The main contributions of this system are more accurate search results, treatment of structured and unstructured data from different sources and formats, aiming to build a knowledge base that supports legal decision-making processes and analysis.
We also surveyed to identify which Brazilian public administration agencies have textual database processing systems, which use technological resources and artificial intelligence and morphological construction techniques. Our findings revealed that Apache Solr is the main indexing engine used by the systems, and Apache Lucene and the Java language were the most used in developing these systems. However, no agency participating in the survey stated that it uses ontology to organize and structure its information. In addition, more than 85% of the Jurisprudence Search systems used by the agencies are not LGPD compliant.
As future work, we will implement further improvements to the Jurisprudence Search system to build a knowledge base, using templates to facilitate the analysis of Jurisprudence opinions. In addition, the use of the system will monitors the users' perceptions regarding the decision support.