Multi-Layout Invoice Document Dataset (MIDD): A Dataset for Named Entity Recognition

: The day-to-day working of an organization produces a massive volume of unstructured data in the form of invoices, legal contracts, mortgage processing forms, and many more. Organizations can utilize the insights concealed in such unstructured documents for their operational beneﬁt. However, analyzing and extracting insights from such numerous and complex unstructured documents is a tedious task. Hence, the research in this area is encouraging the development of novel frameworks and tools that can automate the key information extraction from unstructured documents. However, the availability of standard, best-quality, and annotated unstructured document datasets is a serious challenge for accomplishing the goal of extracting key information from unstructured documents. This work expedites the researcher’s task by providing a high-quality, highly diverse, multi-layout, and annotated invoice documents dataset for extracting key information from unstructured documents. Researchers can use the proposed dataset for layout-independent unstructured invoice document processing and to develop an artiﬁcial intelligence (AI)-based tool to identify and extract named entities in the invoice documents. Our dataset includes 630 invoice document PDFs with four different layouts collected from diverse suppliers. As far as we know, our invoice dataset is the only openly available dataset comprising high-quality, highly diverse, multi-layout, and annotated invoice documents.


Summary
Forbes statistics [1] state that the amount of data produced through daily transactions is very high, and 80% of this daily generated data is unstructured. Unstructured data include Portable Document Formats (PDF), emails, official letters, and many more.
Unstructured data are a valuable asset to the organization as they have a lot of information hidden in them. If an organization extracts these key insights and uses them for the decision-making process, it can significantly increase its operational efficiency [2]. However, manual processing and extracting key insights from such numerous and complex unstructured documents is naturally time-consuming and error-prone. Hence, developing an artificial intelligence (AI)-enabled tool for automatic key information extraction from unstructured data is a promising and upcoming research focus [3]. However, automatic extraction of key insights from unstructured document research faces certain key challenges [4,5]. One of the most fundamental and critical challenges is to obtain a high-quality, Data 2021, 6, 78 2 of 10 standard, and annotated unstructured document dataset. The dataset is the foremost important entity for machine learning model training. Model robustness and accuracy depend on its learning from the training data. Therefore, it is always necessary to include variations in training data so that the model learns to recognize the unknown data. Standard datasets or publicly available datasets in this research face key challenges discussed as follows: • Publicly available datasets consist of poor-quality, blurred, skewed, and low-resolution document images, leading to poor text extraction [6]. • Publicly available datasets are obsolete, consisting of the old or obsolete formats of documents [7].

•
Publicly available datasets are domain-specific and task-specific. For example, the dataset proposed in [8,9] is used for the healthcare domain, and the dataset proposed in [10,11] is used for the legal contract analysis domain. In addition, they are used for specific tasks such as metadata extraction from scientific articles [12] or patient detail extraction from the clinical dataset [13].

•
Publicly available datasets are unlabelled. Therefore, manual labeling or annotating the dataset is time-consuming and tedious [4,5].
Due to these challenges, a few research studies proposed a custom dataset. However, the custom datasets are private and face confidentiality issues [4,14]. The custom dataset also includes documents with similar layouts or formats. Providing similar layout documents restricts the generalizability of key extraction tasks. Few recent studies highlight the significance of the template-free processing of unstructured documents [15,16]. Therefore, to encourage and advance automatic key information extraction research, we developed a dataset. Figure 1 shows an overview of key information extraction tasks from the invoice document. Our multi-layout invoice document dataset (MIDD) dataset contains 630 invoices with four different layouts of different suppliers. The dataset is of high-quality document images, which leads to high accuracy in text extraction. The dataset also helps to generalize the AI-enabled model as it comprises varied and complex layouts of documents. accuracy depend on its learning from the training data. Therefore, it is always necessary to include variations in training data so that the model learns to recognize the unknown data. Standard datasets or publicly available datasets in this research face key challenges discussed as follows:  Publicly available datasets consist of poor-quality, blurred, skewed, and low-resolution document images, leading to poor text extraction [6].  Publicly available datasets are obsolete, consisting of the old or obsolete formats of documents [7].  Publicly available datasets are domain-specific and task-specific. For example, the dataset proposed in [8,9] is used for the healthcare domain, and the dataset proposed in [10,11] is used for the legal contract analysis domain. In addition, they are used for specific tasks such as metadata extraction from scientific articles [12] or patient detail extraction from the clinical dataset [13].  Publicly available datasets are unlabelled. Therefore, manual labeling or annotating the dataset is time-consuming and tedious [4,5].
Due to these challenges, a few research studies proposed a custom dataset. However, the custom datasets are private and face confidentiality issues [4,14]. The custom dataset also includes documents with similar layouts or formats. Providing similar layout documents restricts the generalizability of key extraction tasks. Few recent studies highlight the significance of the template-free processing of unstructured documents [15,16]. Therefore, to encourage and advance automatic key information extraction research, we developed a dataset. Figure 1 shows an overview of key information extraction tasks from the invoice document. Our multi-layout invoice document dataset (MIDD) dataset contains 630 invoices with four different layouts of different suppliers. The dataset is of high-quality document images, which leads to high accuracy in text extraction. The dataset also helps to generalize the AI-enabled model as it comprises varied and complex layouts of documents. Table 1 summarizes our MIDD dataset specifications.    • To provide the annotated and varied invoice layout documents in IOB format to identify and extract named entities (named entity recognition) from the invoice documents to the researchers working in this domain. Obtaining a high-quality and sufficient annotated corpus for automated information extraction from unstructured documents is the biggest challenge researchers face.

•
To overcome the limitations of rule-based and template-based named entity extraction from unstructured documents traditionally used so far in information extraction approaches. Template-free processing is the only key to processing, and managing a huge pile of unstructured documents in the recent digitized era.

•
To provide varied invoice layouts so that researchers can develop a generalized AI-based model that will train on various unstructured invoice layouts. Obtained structured output can later be utilized for integrating into information management application of the organization and used for the decision-making process.

Related Datasets
The research study [6,[16][17][18] proposed key field extraction from a scanned receipts dataset named the ICDAR (International Conference on Document Analysis and Recognition) SROIE-2019 dataset. It has 1000 scanned receipt images with similar layouts, including 876 annotated receipts with labels such as the name of a company, address, date of receipt, and total amount. The research study [19] used the RVL-CDIP dataset that includes scanned document images of different categories, including invoices as one of the categories. It has 25,000 images of every category. However, the dataset is obsolete and of poor-quality scanned documents.
A few research studies [20,21] built a custom invoice dataset for key field extraction tasks. However, these datasets are not publicly available to researchers due to privacy and confidentiality issues in invoice documents.
A few research studies [8][9][10]12] proposed information extraction tasks on various domain-specific and task-specific datasets such as the I2b2 2010 (Informatics for Integrating Biology and the Bedside) clinical notes dataset, the MIMIC-III (Medical Information Mart for Intensive Care) dataset, the custom-built legal contract document dataset, and the GROTOAP2 dataset.

Data Description
To the best of our knowledge, our dataset is the first publicly available multi-layout invoice document dataset. The proposed MIDD dataset includes invoices of different layouts collected from different supplier organizations. Developing generalizability and model robustness are the main aims of collecting the highly diverse and complex invoice layouts. In addition, researchers may use training and testing samples from the dataset as per their requirements. Table 2 represents the details of varied layouts and the total number of invoice document PDFs collected for each layout.   Figure 2 shows the detailed process of our multi-layout invoice document dataset creation.   Figure 2 shows the detailed process of our multi-layout invoice document dataset creation.

Data Acquisition
Invoices from different supplier organizations are acquired to obtain variations in layouts of invoices. All the supplier invoices are scanned using HP Laserjet M1005 MFP

Data Acquisition
Invoices from different supplier organizations are acquired to obtain variations in layouts of invoices. All the supplier invoices are scanned using HP Laserjet M1005 MFP Printer and Scanner (HP: Pune, India) in PDF Format. Each supplier has its unique and own layout or format of their invoices. The varied and multiple layouts of invoices later will help researchers in the template-free processing of various unstructured documents. As shown in Table 2. four different layout invoices from different supplier organizations Data 2021, 6, 78 5 of 10 are collected. The number of scanned invoice PDFs for layout 1 is 196, layout 2 is 29, layout 3 is 14, and layout four is 391. Thus, our dataset has 630 total scanned invoice PDFs. Figure 2 illustrates that, as scanned PDFs of invoices are collected, each of these scanned PDFs is converted into an image as the initial step. Conversion from PDF to image is required as any optical character recognition (OCR) engine takes the image as its input for text detection and extraction. Later, all the corresponding text files obtained after OCR are input for the data annotation tool. The Google Vision OCR (Version 2.3.2) engine is used for text extraction, and the UBIAI annotation tool is used for manual annotations. Google Vision OCR output is validated manually by cross verifying the extracted text and original text contents. Roughly 30% of invoices from each invoice layout were manually verified and validated against original scanned PDF contents for evaluating OCR accuracy. As a result, it was observed that Google Vision OCR gives 90% accuracy in text extraction for each supplier invoice layout. The MIDD dataset was also evaluated using AI approaches such as BiLSTM and BiLSTM-CRF for key fields extraction tasks [22]. Figure 3 shows the sample invoices of all four different layouts from our dataset. Different key fields are colored with different colors to understand the meaning of "different layouts of invoices." For example, the invoice number key field shown with the green rectangular surround box is positioned at different locations in all four layouts of invoices collected from different suppliers. Likewise, all other invoice key fields take a different position as the supplier organization changes. Thus, our dataset is useful for dealing with a real-world situation where organizations have their own unique format or layout of unstructured documents. Table 3 summarizes more statistical information on the proposed dataset. Invoices collected for MIDD belong to the construction firm having different suppliers for different material purchases.   Table 3 summarizes more statistical information on the proposed dataset. Invoices collected for MIDD belong to the construction firm having different suppliers for different material purchases.

Data (Named Entities) Annotation
Converted text files are manually annotated. UBIAI is a convenient and simple-to-use text labeling/annotation tool used for most Natural Language Processing (NLP) tasks such as named entity recognition (NER). By selecting the "free version" of the UBIAI package under the "Package" option, all decided labels for invoice entities ( for example, INV_L for invoice label and INV_DT for actual Invoice Date value) to annotate the invoice document are supplied through its interface manually. After providing labels, the converted text files are inputted manually either by the "drag and drop" facility or the "browse" facility of the UBIAI interface. After all the text files are inputted, they can be annotated by choosing labels individually and highlighting the respective text within a complete text file, as shown in Figure 4. This also provides an annotation file export facility in multiple formats such as spacy, IOB, and JSON. As shown in Figure 2, the text files obtained after OCR are provided to UBIAI with the labels to be annotated. Table 2 shows the name of the labels used to annotate the text files. Finally, the annotated text file of each invoice in IOB format is exported. IOB labels are like part-of-speech (POS) labels, but they signify the inside, outside and beginning of a word. In NER, every word in the text file, also called "token," is labeled with an IOB label, and then adjacent tokens are joined together depending on their labels. Later IOB files are transformed into a .csv file. Figure 5 shows the sample IOB file of Data 2021, 6, 78 7 of 10 the invoice. The first column represents the "token," and the second column represents the IOB tags of the respective tokens. complete text file, as shown in Figure 4. This also provides an annotation file export facility in multiple formats such as spacy, IOB, and JSON. As shown in Figure 2, the text files obtained after OCR are provided to UBIAI with the labels to be annotated. Table 2 shows the name of the labels used to annotate the text files. Finally, the annotated text file of each invoice in IOB format is exported. IOB labels are like part-of-speech (POS) labels, but they signify the inside, outside and beginning of a word. In NER, every word in the text file, also called "token," is labeled with an IOB label, and then adjacent tokens are joined together depending on their labels. Later IOB files are transformed into a .csv file. Figure 5 shows the sample IOB file of the invoice. The first column represents the "token," and the second column represents the IOB tags of the respective tokens.   Figure 6 shows the data pre-processing steps carried out to obtain the pre-processed data.  Figure 6 shows the data pre-processing steps carried out to obtain the pre-processed data.

Data Pre-Processing
Data 2021, 6, x FOR PEER REVIEW 9 of 11 Figure 6. Data pre-processing steps.
Stop-word removal: stop-words like English articles (the, an, a) and conjunction (or, and) are removed from the data. Stop-words take up memory size and require processing time. In addition, stop words have no contribution to the meaning of a sentence, so these stop-words are removed from the data.
Lowercasing: all the tokens or words are converted into lowercase to get the consistent output.
Duplicate and blank-row removal: all the duplicate rows and blank rows are manually removed from the .csv files using the "Remove rows" option in Microsoft Excel.
Non-alphanumeric characters' removal: few non-alphanumeric characters like "(" are removed from the data files. Few non-alphanumeric characters like ":" are kept in the data files as they are part of some fields like Invoice Date.

Practical Applications/Use-Cases of MIDD
Currently, available literature mainly focuses on information extraction from receipts with a similar format or layout. However, in a practical scenario, organizations receive invoices from various suppliers having their unique structure or layout. Publicly available datasets lack an invoice document dataset which has varied invoice layouts.


The proposed MIDD dataset has many practical implications for extracting named entities as a structured output from the huge pile of unstructured invoice documents. In addition, end-to-end automation of invoice information extraction workflow helps the accounting department in every organization for quick invoice processing and to verify accounts payable and receivable.  Automated key field extraction from financial documents such as invoices impacts the performance of the business by customer onboarding and verification processes. It can reduce significantly the cost employed for manual data entry and verification of thousands of daily received invoices.

Conclusions
The proposed work developed a multi-layout invoice document dataset consisting of 630 invoices with four different layouts from different supplier organizations. It contributes to this research by providing a highly diverse, high-quality, and annotated dataset, which is very useful in Natural Language Processing tasks such as named entity recognition (NER). The dataset size and contents are sufficient to build a generalized AI- Stop-word removal: stop-words like English articles (the, an, a) and conjunction (or, and) are removed from the data. Stop-words take up memory size and require processing time. In addition, stop words have no contribution to the meaning of a sentence, so these stop-words are removed from the data.
Lowercasing: all the tokens or words are converted into lowercase to get the consistent output.
Duplicate and blank-row removal: all the duplicate rows and blank rows are manually removed from the .csv files using the "Remove rows" option in Microsoft Excel.
Non-alphanumeric characters' removal: few non-alphanumeric characters like "(" are removed from the data files. Few non-alphanumeric characters like ":" are kept in the data files as they are part of some fields like Invoice Date.

Practical Applications/Use-Cases of MIDD
Currently, available literature mainly focuses on information extraction from receipts with a similar format or layout. However, in a practical scenario, organizations receive invoices from various suppliers having their unique structure or layout. Publicly available datasets lack an invoice document dataset which has varied invoice layouts.

•
The proposed MIDD dataset has many practical implications for extracting named entities as a structured output from the huge pile of unstructured invoice documents. In addition, end-to-end automation of invoice information extraction workflow helps the accounting department in every organization for quick invoice processing and to verify accounts payable and receivable.

•
Automated key field extraction from financial documents such as invoices impacts the performance of the business by customer onboarding and verification processes. It can reduce significantly the cost employed for manual data entry and verification of thousands of daily received invoices.

Conclusions
The proposed work developed a multi-layout invoice document dataset consisting of 630 invoices with four different layouts from different supplier organizations. It contributes to this research by providing a highly diverse, high-quality, and annotated dataset, which is