An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

: Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identiﬁcation of entities and their relations, being some of which difﬁcult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from ofﬁcial and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classiﬁcation of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identiﬁcation of terms related to the criminal domain. Dataset: https://github.com/goncalofcarnaz/Annotated-Corpus-of-Criminal-Related-Portuguese-Documents Dataset


Background and Summary
Criminal activity is present daily, in a multiplicity of illegal actions in several domains, such as drug trafficking, computer crime, and theft, just to mention a few examples. Upon the occurrence of a crime, criminal police investigators are in charge and start a set of actions to enable the construction of the so-called chain of custody [1] to identify the alleged culprits and to present them in court.
These actions are multidisciplinary and, depending on the crime, they may involve distinct tasks, such as digital and biological forensics analysis, interviews with witnesses, and interrogations with suspects and other individuals that may be potentially implicated. All of these actions produce textual faithful reports throughout the investigation, where all the Data 2021, 6, 71 2 of 11 facts are extensively described, together with an exhaustive identification of the individuals, places, and other entities that could be relevant for the course of the investigation.
During the investigation, these reports are carefully analyzed, searching for relations between names, places, license plates, and other entities. This is a manual and timeconsuming task for the investigators, which usually concentrate the information gathered on a wall dashboard, where many pieces of papers with names and entities are posted and, in a certain way, visually interconnected. Some tools are being used to help investigators' work, such as customized Microsoft™ Excel spreadsheets or the widely used criminal investigation tool IBM™i2 Analyst's Notebook (see https://www.ibm.com/products/i2 -analysts-notebook (accessed on 1 June 2021)).
Several comprehensive research works have also produced dedicated tools and frameworks to automatically extract entities and their relationships, from a set of documents. Some of these tools and frameworks are indicated as follows: the Jigsaw [2], the Police Intelligence Analysis Framework [3], and the Combined Websites and Textual Document Framework (CWTDF) [4].
There are also crime-related ontologies to interpret terms and relations in this context, and they are further used for knowledge representation in some existing frameworks. Some examples are the Project Multi-Modal Situation Assessment and Analytics Platform (MOSAIC) [5]; the CAPER [6], which uses simultaneously the European LEAs Interoperability Ontology and the Multi-Lingual Crime Ontology; and the ePOOLICE [7] project. However, these tools usually have two main flaws. Firstly, they are not multi-lingual, and the existing annotated corpus for criminal domain are mostly available for the English language. Secondly, these tools possess limited visualization features, namely on representing graphically the recognized named-entity relationships.
Portuguese has around 230 million native speakers. It is the ninth most spoken Indo-European language (see https://www.visualcapitalist.com/100-most-spoken-languages/ (accessed on 1 June 2021)) and the sixth most spoken by number of native speakers (see https://www.babbel.com/en/magazine/the-10-most-spoken-languages-in-the-world (accessed on 1 June 2021)). To the best of the authors' knowledge, an annotated set of crime-related documents written in the Portuguese language, composed of a training and testing corpus, which can be widely used to evaluate information retrieval system performance has not yet been developed. Bearing that in mind, the construction of an annotated corpus for crime-related documents is of crucial importance.
In the criminal domain, if the corpus content represents the specifications of linguistic phenomena, and if an extrapolation to a more significant population from which it is taken is possible, then it is possible to say that it "represents that language variety". In [8], the authors proposed to extract attributes that can be used to define the different types of texts and contribute to creating a balanced corpus. The criminal domain has its own vocabulary and narrative and assimilates the writing style. Consequently, criminal domain experts advise the inclusion of criminal news and official websites, arguing that they follow the same narrative form and similar requirements of the criminal investigation reports. This paper presents an annotated corpus for the Portuguese language, which can be applied to information retrieval from crime-related documents. The corpus was evaluated in [9] where a framework was deployed to apply Natural Language Processing (NLP) and Machine Learning (ML) methods, with the aim to extract and classify named-entities and relations extracted from Portuguese criminal reports and documents. A 5WH1 (Who, What, Why, Where, When, and How) information extraction method was also applied, and the relations extracted were stored and represented in a graph database. The corpus was evaluated by a developed prototype, composed of the following components and technologies: Apache Tika toolkit for detection and extraction of metadata and text from files, Newspaper3k (https://newspaper.readthedocs.io/ (accessed on 9 June 2021)) for article scraping and curation; NLPNET (https://www.github.com/erickrf/nlpnet/ (accessed on 9 June 2021)), a Python library for Natural Language Processing (NLP) tasks based on neural networks; Apache OpenNLP toolkit (http://opennlp.apache.org/ (accessed on 9 June 2021)); and the NLPPort (https://www.github.com/rikarudo/NLPPORT/ (accessed on 9 June 2021)) toolkit. For Named-Entities Recognition (NER) evaluation, the documents were manually annotated and processed by the framework. The NER achieved an F1-score of 0.73, while 5W1H (Who, What, Whom, When, Where, How) information extraction performance attained an F1-score of 0.65.
The proposed crime-related documents dataset for the Portuguese language has the following benefits for researchers and practitioners: (1) a clean and organized set of Portuguese crime-related documents in XML format; (2) a corpus with annotated namedentities extracted from the available documents; (3) an initial approach of annotated documents to answer the 5W1H questions set; and (4) an annotated corpus for the narcotics type of crime.
The remainder of this paper is organized as follows. Section 2 describes the dataset collection and processing, namely the criminal news articles, PGdLisboa News, and Criminal Investigation Reports. It also details the anonymization applied to these documents and the dataset that was built to extract the semantic information from sentences, by using the 5W1H approach. Section 3 details the methods applied to process and use the data. Finally, Section 4 describes the technical validation of the dataset.

Data Description
The dataset described in this paper corresponds to an annotated corpus derived from Portuguese crime-related investigation reports and criminal news and is available at the following GitHub repository: https://github.com/goncalofcarnaz/Annotated-Corpusof-Criminal-Related-Portuguese-Documents (accessed on 25 June 2021). It was tested and evaluated in the SEMantic Crime framework, developed by the authors and recently published in [9,10].
The dataset is composed of a set of XML files, each one corresponding to an annotated document. The crime-related documents are from three distinct types and were originally retrieved from the following sources:  Table 1 enumerates the amount of crime-related documents and syntactic components that were used to build the corpus. The components retrieved from the documents (described in Table 1) have produced a set of named-entities from different types, which are indicated in Table 2. The dataset construction was made by a crawler software (described below), which processes the available documents that have the following general features: • are written in a free text form, whether in unstructured or semi-structured format; and • can be available online or offline.
The XML files were created from files with different formats, namely Microsoft™ Word, PDF, and HTML. The following applications were applied to process each file to the XML formats: • Apache Tika toolkit to process Microsoft™ Word and PDF files; and • Newspaper3k toolkit, for online article scraping and curation, to process HTML files.
A cleaning method was developed, which ensures the following set of rules: • remove spaces, line breaks, duplicate white-spaces, and tabs; • consider commas followed by a space; • each sentence contains a single end-mark; • remove all characters that are not in the ASCII character set; and • split attached words, such as "RuaPrincipal" should be replaced by ''Rua Principal".
A method to process abbreviations and acronyms was proposed and validated by Carnaz et al. [9]. In general terms, a database of acronyms and abbreviations was setup and is being fed with new coming and confirmed entries. A pattern-based rule set is used to search for terms that are candidates to be considered as abbreviations or acronyms. If these new terms already exist on the list, they are expanded. Otherwise, they are annotated and added to the list. Figure 1 depicts the overall process to collect and process the data. The documents are processed and converted to an XML format. After that, each document undergoes an "Extract, Transform, Load" process, to be subsequently annotated. Several tasks were applied to extract data from police reports and open sources: • data were extracted from websites and files in Microsoft™ Word, PDF, or HTML formats; • words or symbols that may cause "noise" or are not relevant were removed by the cleaning tasks; • transformations were applied to expand acronyms and abbreviations; and • an XML schema and the corresponding XML files were created (see Sections 3.1 and 3.2) for each crime-related documents types.
A Java class was developed to convert the documents to the XML formats detailed in Section 3. The class and the corresponding methods are available in the GitHub repository.

Methods
This section details the methods applied to collect, process, and use the data. More specifically, it describes the processing of PGdLisboa and criminal news, as well as criminal investigation reports.

Online Criminal News and PGdLisboa Articles
Online newspapers are a privileged medium to spread crime-related news, where actors and facts are identified and described. It is an open source of knowledge available in a wide set of languages and an interesting way for dataset enrichment. Despite the restrictions imposed by the criminal domain, namely the issues related to data and investigation confidentiality, these documents are worth collecting and including in the dataset, due to the following main reasons: • the narrative is similar to the one observed in police investigation reports; • the use of entities to describe the crime, such as individuals' names, is also part of criminal news; • the use of terms that obfuscate the entities, such as personal names being replaced by "suspect", is also identified in these documents.
Listing 1 details the XML schema that was used for online criminal news and for the PGdLisboa articles. The content of the online criminal news and PGdLisboa online articles follows a well-known and easily recognizable template. The "element name" XML tag was used to annotate the most relevant data, namely document name, title, author(s), publication date, and the text itself. The XML files related to the documents extracted from the online news, and PGdLisboa articles, are available at the GitHub repository (folder /Data Set/Data Collection/).  Table 3 describes the different types of the "element name" XML tag that were processed in the criminal news and PGdLisboa articles and depicted above in Listing 1.

Criminal Investigation Reports
The criminal investigation reports detail the information collected during an investigation, each one possessing one or more documents. These documents are usually closed and with restricted access (classified), which brings additional challenges to the documents' analysis. Listing 2 depicts the XML schema applied to the criminal reports, where it is possible to identify the "element name" XML tags used to extract the most relevant content. The layout is similar to the one used by the criminal and PGdLisboa news processing (Section 3.1).
Listing 2: Criminal investigation reports -XML Schema. The layout structure was extracted by manually analyzing the reports and the set of tags that were used for text labeling. The document name, author(s), publication date, process identifier number (internal police number to identify each report), title, and document body are defined as annotation sections in the documents. The documents were anonymized to omit persons, phone numbers, and other confidential data (Section 3.3). Images were also disregarded from the document's preprocessing and analysis. Table 4 describes the distinct "element name" tags that were identified in the XML schema.

Anonymization
In order to address the privacy and data protection concerns, the documents related with the criminal investigation reports were manually anonymized to remove all Personally Identifiable Information (PII) such as name, address, phone number, license plate, and other personal information. The following tags were defined to identify entities that have to be anonymized in the criminal investigation reports: PERSON, LOCATION, NUMERIC, ORGANIZATION, TIME/DATE, LICENSEPLATES, and PHONENUMBER. For each predefined tag, a sequential number was added at the end in each occurrence in the text. Below, we present a sentence that illustrates the output provided by the anonymization task in a criminal investigation report: In Portuguese: "Na sequência das detenções efectuadas, foram o PERSON01 e a PERSON02 presentes à Justiça".
In English: "Following the arrests made, PERSON01 and PERSON02 were brought to justice". This way, the official documents were de-identified, by changing names, places, and other PII. Notwithstanding, the criminal investigation reports have already become res judicata, being publicly available, after proper request for full-access to the authorities. It is worth noting that, despite the documents' anonymization, they kept the original context.

Named-Entities Annotation
NLP tools and frameworks, such as those that use Named-Entities Recognition (NER) processing, take advantage of Named-Entities (NE) that have been identified and annotated, such as persons, locations, and license plates.
The crime-related documents were manually annotated by applying the XML template illustrated in Listing 3. The "element name" XML tags were used to extract the most relevant content of the documents. Some examples are documentname, authors, and publicationdate. The documents are available at GitHub repository, in the folder /Data Set/NER/Criminal-Related Documents NE Annotated. Each sentence has an identification and a list of names and entities that are eligible to be annotated. Listing 4 depicts the sentences analyzed by the XML processing. In this example, the sentence 1 (identified by the tag <sent1>) has two annotated entities: a person name and a number.

Narcotics Corpus
A specific corpus was built to accommodate the terms intrinsically related to narcotics in the Portuguese language. The following presumptions were made on the extraction of terms related with this type of crime: • the narcotics are mentioned in their official designation as well as the one used on the street, through slang; and • drug trafficking is one of the most reported and typified crimes investigated by the criminal police [12].
To the best of the authors' knowledge, there are no annotated texts related to the narcotics' crime domain in the Portuguese language. To overcome this limitation, a manual annotation was made, by labeling the correct entities using a narcotics list, with current official and street names. The corpus is available in the GitHub repository (folder Data Set/NER/Narcotics). It was built by extracting texts from daily newspapers and blogs that mention narcotics' terms. The sentence below illustrates how the documents have been annotated using the Apache OpenNLP (https://opennlp.apache.org/ (accessed on 25 June 2021)) tool notation.

5W1H Annotation
This section introduces a dataset to help researchers that intend to use the 5W1H approach (Who, What, Whom, When, Where, How) to extract the semantic information from sentences [13]. This approach was introduced by Griffin [14] and is widely used in journalism. However, in criminal investigation, the same methodology is applied by the investigators, as they seek to answer the 5W1H questions to analyze the facts and further identify the criminals [15]. The 5W1H methodology provides facts about a criminal document, by answering the following questions: •  An annotation scheme was defined to extract useful 5W1H questions from the documents. This annotation scheme can be used by supervised learning algorithms. Several research works apply the 5W1H approach with an annotated corpus in the English language [13,16,17], reinforcing the need to have a similar corpus for the Portuguese language. The annotated documents are available in the GitHub repository, in the folder Data Set/5W1H.

Technical Validation
The corpora presented in this paper was evaluated by processing the machine learning models implemented on the Apache OpenNLP platform and a prototype developed in Java [9]. The perceptron algorithm was used to evaluate the named-entities recognition, the 5W1H extraction model, and the narcotics terms extraction. The results obtained with NE recognition are summarized in Table 5. The experiments were conducted over the crime-related documents dataset and were supported by a prototype developed in Java. The results identify with an average precision of 0.808, recall of 0.722, and F1-score of 0.733, obtained with the processing of criminal news, PGdLisboa articles, and criminal investigation reports. These results illustrate the correctness of the classifier to identify the named-entities that are annotated in the dataset [9].
The 5W1H Information Extraction Method was evaluated using a set of 20 crime-related documents, annotated by external contributors. Table 6 summarizes the performance evaluation obtained with the proposed set, namely precision, recall, and F1-score. The corpus for Narcotics was also evaluated with the ML perceptron algorithm, and the results are summarized in Table 7. The corpus was evaluated by applying a 10-fold stratified cross-validation method. The dataset was divided into 10 equal parts and, for each run, nine parts were used to train the model and the remaining one to test. The average results obtained for precision, recall, and F1-score are 0.784, 0.768, and 0.771, respectively. The experiments were made with the Apache OpenNLP platform, and the Narcotics corpus and scripts were uploaded into a GitHub repository (folder Data Set/NER/Narcotics/Narcotics Classifier) [9]. Table 7. Narcotics dataset evaluation with 10-fold cross-validation.

Precision
Recall F1-Score 0.784 0.768 0.771 To the best of authors' knowledge, the exploratory dataset that was delivered and made available , is the first comprehensive approach to have a dataset in the Portuguese language related to the criminal domain. It should benefit the ML and NLP practitioners, on benchmarking models and frameworks on Portuguese language processing.
For future work, the dataset will be continuously updated with more anonymized criminal reports and online news articles. Curation tasks will be continuously applied to enrich the quality of the dataset and enhance the learning methods performance. Data curation encloses several challenging sub-problems to maintain a dataset available and with high quality to be used by data science researchers. These tasks are intrinsically related to: (1) the increasing volume of the dataset, which need to have a track changes log; (2)