1. Introduction
Extracting information from semi-structured documents is essential in many business processes, as organizations often need to automatically retrieve data to feed their Enterprise Resource Planning (ERP) systems. This is typically conducted manually, being a time-consuming task prone to errors [
1]. This problem has been a topic of interest in machine learning and data science for many years [
2]. The complexity of invoice formats and the variability in data presentation make this a challenging task.
Electricity invoices, in particular, are special business documents that contain various information. These documents are particularly intricate due to the complexity of electricity markets. These markets usually serve millions of customers, involve multiple stakeholders, such as marketers and distributors, and are subject to specific national regulations. As a result, electricity bills contain diverse and abundant data. Therefore, the main difficulty arises from documents with varying contents and layouts.
Several commercial applications, such as DocParser [
3], ITESOFT [
4], or ABBYY [
5], may facilitate the task of information extraction. They allow invoice information to be retrieved automatically, although they typically require the users to configure template bills manually. This information is incorporated into databases, which later permit the identification and processing of similar invoices. Nevertheless, it does not usually scale well to new types of documents. Furthermore, current machine learning techniques are often designed to recover just a few generic fields, such as invoice numbers or total amounts, so even if commercial solutions exist, this problem is not solved yet.
There are several types of methods for extracting information from invoices. Probably the most traditionally used involve user-defined templates [
1] and the use of predefined-rules [
4]. However, these techniques typically need to define templates by hand and do not adapt well to new types of documents. On the other hand, some methods are based on custom-designed features and machine-learning techniques [
6,
7]. They usually adapt well to a given set of templates and do not need a large dataset to train on.
Another type of method relies on classification through Named Entity Recognition (NER) [
8], usually combined with neural networks [
9,
10]. Recent advancements in deep learning have led to the use of more complex models, such as Trie [
11] and LayoutLM [
12], which integrate visual and textual information, addressing the challenge of handling both text and images in a unified framework. This can improve accuracy and adaptability because of the visual structure of invoices. However, a drawback of these methods is that they need to be trained on large datasets and it is often complex to adapt the samples to the format required by these neural networks. Most methods are designed for processing simple invoices and extracting a few fields, such as the invoice number or the total amount. Adapting them for more complex documents and incorporating additional labels poses a considerable challenge.
Our work is based on a database of electricity invoices, called IDSEM [
13], containing 75,000 invoices. This dataset was created randomly using information from the Spanish electricity market, based on regulations and statistical data. This dataset utilizes eighty-six distinct labels, significantly larger than similar datasets.
The main goal of our paper is to explore and compare different algorithms for extracting information from electricity invoices, which are challenging to process due to their complex structure and varied formats. We propose a method for extracting information from invoices using a bag-of-words approach and standard machine-learning techniques. The bag-of-words approach relies on Term Frequency and Inverse Document Frequency (TF-IDF) features [
14]. By analyzing the frequency of words, we can determine their importance within the context of the invoice. Then, we characterize each word based on its format: whether it is numeric or alphanumeric, has a predefined format (e.g., money or postal codes), is a measurement unit (e.g., kW or €/kW), includes capital letters, or contains punctuation marks.
This study evaluates the performance of various classification techniques, including Naive Bayes, Logistic Regression, Support Vector Machine (SVM), and Random Forests. Initially, invoices are converted from PDF to raw text. Then, we create lists of eleven-word sentences using a sliding window, with consecutive sentences differing in the first and last words. These sentences are passed through a pipeline composed of TF-IDF, our custom-designed features, and a classification technique. The assigned label corresponds to the central word of each sentence; therefore, we obtain a classification for every word in the document.
The experimental results provide a comparative analysis of the performance of our model using different machine-learning techniques. They show that Support Vector Machines achieve the highest precision (91.86%), closely followed by Random Forests (91.58%). The study also examines the performance of methods on documents similar to the training set and on completely new documents. The results indicate that Random Forests show better adaptability to new documents.
Most labels are classified with high precision, with over 41% of labels having precision rates exceeding 99% and another 31.3% with precision rates above 90%. The experiments also investigate the benefits of using the bag-of-words approach and our custom features. The successive integration of our custom features proves crucial for enhancing the performance of methods. Combining both types of features provides the best results for the eleven-word sentences.
The main contributions of this work can be summarized as follows: (1) we propose a new model based on TF-IDF and custom-designed features for extracting a large number of key values from invoices; (2) we transform a complex dataset into a text-based format that is suitable for classifying the labels of these documents; (3) our proposed custom-designed features provide a high classification rate for the contents of electricity invoices; (4) we analyze the performance of different machine-learning techniques based on this model; (5) we realize a thorough analysis of the benefits of each type of feature and a detailed evaluation assessment of the methods regarding each label and template. We also discuss the advantages and drawbacks of our approach and provide some guidance to improve it.
This model offers several advantages over existing methods. Primarily, it is designed to handle documents with a large number of fields, which is not typical in the literature where most methods focus on simpler receipts and invoices with few labels, typically less than ten. Another advantage is the specificity of electricity invoices. Unlike general information extraction methods, our approach is specifically tailored to handle the content of electricity invoices. These documents contain a wide variety of data from different sources, making them more complex than other invoices. The adaptability of the model to new document types is another advantage, as it uses a general pipeline based on textual information, allowing for easy adaptation without significant changes to the process.
Section 2 describes similar methods and summarizes the state-of-the-art.
Section 3 describes the IDSEM dataset and the conversion of PDF documents into the training data.
Section 4 explains the details of the TF-IDF and our custom features.
Section 5 describes the classification methods and
Section 6 explains the metrics employed and the experimental design. The results in
Section 7 analyze the performance of the methods using standard metrics, such as precision, recall, and F1-scores, paying special attention to the influence of the features and the precision regarding new documents. We discuss the results of the method and some future works in
Section 8.
2. Related Work
Information extraction is an active research field with applications in various domains such as historical document analysis, web content mining, or business document processing. Over the past two decades, researchers have proposed many methods to tackle this problem, with more interest in machine learning techniques nowadays.
Multiple datasets and benchmarks have played an important role in advancing this field. For instance, the SROIE dataset [
2], one of the earlier receipts datasets, contains 1000 scanned receipt images and is used for three main tasks: text localization, OCR, and key information extraction. Another receipts dataset is CORD [
15], which contains thousands of Indonesian receipts, including images and text annotations for OCR, and multi-level semantic labels for parsing. These datasets contributed to the development of many methods but they contain simple receipts with a few annotated fields.
DocILE [
16] is another dataset with over 6700 annotated business documents, manually labeled and synthetically generated. It provides annotations for fifty-five classes, covering various document layouts. More recently, the benchmark proposed in [
17] provided two new datasets, HUST-CELL and Baidu-FEST, that aim to evaluate the end-to-end performance of complex entity linking and labeling, containing various types of documents, such as receipts, certificates, licenses of several industries, and other types of business documents. The IDSEM dataset [
13] that we use in this work contains 75,000 synthetically generated invoices related to electricity consumption in Spanish households. It has a large number and variety of labels. Each invoice includes data about customers, contracts, electricity consumption, and billing that are challenging to classify.
The first methods for extracting information from invoices involved user-defined templates and the use of rules. This was the case of SmartFIX [
1], which was among the earliest systems in production for processing business documents, ranging from fixed-format forms to unstructured letters. The work proposed in [
18] presented a method for automatically indexing scanned documents, reducing the need for manual configuration. The authors leveraged document templates stored in a full-text search index to identify index positions that were successfully extracted in the past. The authors of [
4,
19] proposed an incremental learning approach based on structural templates. The document was recognized by an OCR and then classified according to its layout. The fields were extracted using structural templates learned from the layout. The lack of information detected by the users was used to improve the templates in an incremental step. These methods require the users to create new templates manually and check the final results.
Information extraction can also be tackled through NER [
20], a typical step in any NLP pipeline. In this case, words in the text are tagged with specific labels. Some typical examples [
8] relied on Long Short-term Memory (LSTM) neural networks [
9,
21]. The method proposed in [
22] compared two deep learning approaches, the first based on NER using a context of words around each label, and the second based on a set of features. This second method provided similar results with less training data. In the preliminary work presented in [
10], the authors obtained high accuracy for a reduced set of labels using the IDSEM dataset. They also used an NER strategy. The implementation was realized through a neural network based on the Transformer [
23] architecture. These strategies typically work correctly for a few labels and the vocabulary needs to be similar for all documents.
Another group of methods relies on custom-designed features and shallow machine-learning techniques. For example, Intellix [
6] was based on TF-IDF features and K-Nearest-Neighbors (kNN). In our work, we use a similar approach, with a larger variety of features, and analyze the behavior of more classification techniques. Another example was the method presented in [
7], which converted documents to JPEG images and extracted word tokens through an OCR system with their positions. This method relied on a gradient-boosting decision tree model for the classification step. In the experiments, we use Random Forest, another ensemble method based on decision trees. We choose these techniques as they give us the flexibility to choose the most convenient features and adapt well to the problem that we are facing.
More recently, several methods based on deep-learning techniques have been proposed. Various approaches [
24,
25] combined convolutional neural networks (CNN) with a spatial representation of the document using a sparse 2D grid of characters. This allows for understanding the semantic meaning and the spatial distribution of texts. Other methods are based on LSTM networks, like CloudScan [
26], or a combination of CNNs and LSTMs, like in [
11,
27,
28]. The latter presented a multi-modal end-to-end information extraction framework based on visual, textual, and layout features. However, this requires the dataset to be annotated with bounding boxes around the key values, and the IDSEM dataset does not provide this information.
The Transformer [
23] has also been employed in several works, like in XLNet [
29] that integrated ideas from the Transformer-XL [
30] architecture, which enabled learning for longer-length contexts. DocParser [
3] was an end-to-end solution based on a Swin Transformer [
31] comprising a visual encoder followed by a textual encoder. BERT [
32] was also adapted to information extraction in BERTgrid [
33]. The most recent methods [
12,
34,
35] relied on multi-modal pre-training models integrating text and images, or methods that establish geometric relations during pre-training, like in [
36].
These methods are complex and rely on visual annotations to train the models. However, the IDSEM dataset contains textual annotations and is difficult to adapt for training with these networks. In our model, invoices are converted to text format and the features are designed at the sentence level, being a simple and efficient solution.
3. Electricity Invoice Dataset
In this section, we explore the IDSEM dataset, providing some statistics about the labels. Then, we explain the data transformation from PDF invoices into the training sentences.
3.1. IDSEM Dataset
The IDSEM dataset [
13] contains 75,000 electricity invoices in PDF format, with 30,000 in the training set and 45,000 in the test set, for which the ground truth values are not publicly available. Each invoice is defined with eighty-six labels related to customer data, the electricity company, the breakdown of the bill, the electricity consumption, etc. This dataset is available at Figshare [
37].
The labels are defined with a code starting with a letter (
A–
N) and a number for each label in the group.
Table 1 summarizes the groups of labels of the dataset and their meaning is explained in [
13].
The dataset includes a training directory with six sub-directories for different template documents. Each directory has 5000 invoices in PDF format and their corresponding JSON files with the labels and their true values. The test directory contains nine sub-directories with nine different templates, three of them completely different from the training set. These allow us to study the behavior of the method to unseen data.
Figure 1 depicts two pages of an invoice from the dataset. Blue boxes relate the groups of labels and their position in the document.
Table 2 shows the frequency of occurrence of several labels in the documents of the training set. For example, the label
C1, corresponding to the marketer’s name appears 564,700 times. Therefore, this information appears multiple times in each invoice. On the other hand, label
M3, corresponding to the equipment rental price, appears 5000 times. In this case, the label appears only in one template of the training set. Additional statistics for all the labels can be found in
Appendix A.
3.2. Data Conversion
Similar to other approaches [
26], our method converts the input data into text using a PDF-to-text converter. The end-of-line character is replaced by a special code,
#eol, to ensure the invoice is treated as a continuous list of words. We generate eleven-word sentences, using a sliding window of one word between consecutive sentences. The label is associated with the central word of the sentence. Our method is complete as it predicts a label for every word in the document.
Using a sliding window approach enables the classification of individual words within a document. While this method can be time-consuming, its efficiency can be significantly improved through parallel processing. By parallelizing the processing of each sentence, the overall execution time can be reduced to approximately that of a single sentence. On the other hand, we chose eleven words per sentence because this provided the best precision, as shown in the experiments.
We utilize a special label, OO, standing for others, to classify sentences for which the central word does not match any label in the dataset. This allows the model to deal with out-of-class words efficiently. The OO label is introduced during the conversion of PDF documents to our internal representation of sentences. This common strategy is used frequently to integrate this type of token in the classification process, without modifying the main pipeline, making it simpler and more efficient. This robust approach allows the model to generalize to new unclassified words. We generate a training file for each template directory, containing fifty invoices, resulting in six training files with 300 documents.
These files consist of tuples with two fields: one with an eleven-word sentence and another with the label corresponding to the central word. Therefore, the number of sentences coincides with the total number of words. For this process, we use the annotated files of the training set, which include the code of the labels around each key value. We parse the file contents, remove the annotations from the text, and generate the tuples. A fragment of a training file in our format is shown in
Table 3.
In this example, we observe that various sentences refer to a single label value, such as the total amount (J5), the invoice number (F1), or various dates (F4s and F5s), and several sentences refer to the same label, such as the due date (G3). Label OO appears with high frequency since the central word of many sentences does not correspond to any label.
4. Definition of Word-Level Features
We employ a bag-of-words approach based on TF-IDF [
14] features. These are often used in NLP and information retrieval to assess the importance of a term within a document relative to a corpus of documents. Term Frequency (TF) measures how often a term
t appears in a specific document
d and is calculated as the ratio of the number of times a term appears in a document to the total number of words in that document:
Document Frequency (DF), on the other hand, assesses how common a term is across the entire corpus and is calculated as the number of documents that contain a term t, i.e., .
Inverse Document Frequency (IDF) measures the relevance of a term and is the inverse ratio of the total number of documents to the document frequency of the term:
where
N is the total number of documents. TF-IDF is then measured as
. Words with higher TF-IDF scores are considered more significant within a document.
Due to the limited vocabulary of electricity invoices, this method utilizes fewer tokens. We found that using single words and bi-grams improved classification precision in the experiments. Furthermore, the vocabulary obtained by the TF-IDF estimator using the training data contains many irrelevant words. Therefore, we define a dictionary of stop-words to improve the performance of this technique. This dictionary includes articles, prepositions, pronouns, and other frequent words in electricity invoices that are not relevant.
These features are insufficient to achieve high scores due to the limited number of words per sentence and the significant overlap between consecutive sentences. To address this issue, we designed a set of custom features that enable us to generate more distinctive data. One of these features is based on the word type, i.e., whether it contains alphabetic, numeric, or alphanumeric characters, or is a decimal number. Another feature is related to the format of the word, checking if its structure resembles money, email or web URLs, Spanish identity card numbers, postal codes, or date formats. We have also included features that provide information about different units, such as money (Euro symbol and other currency-related words), kilowatt (kW), kilowatt-hour (kWh), day and month periods, and the percentage symbol. Additionally, we have incorporated a feature to identify capital letters, which are often relevant in various contexts. Finally, we have defined another feature to identify punctuation symbols, such as colons, semicolons, or periods, as they typically convey relevant information, especially in tabular data.
These features are tailored to characterize the kind of words that we usually find in electricity invoices and are associated with the information we want to extract. These are specially designed for the Spanish locale, although they can be easily generalized for other regions.
Table 4 summarizes these features.
Therefore, each word in the sentence produces a set of five custom features concatenated with the above TF-IDF features. In the experiments, we evaluate the benefits of using custom features only for the central word in the sentence or up to the eleven words in the sentence.
5. Classification Methods
In this work, we select the following machine-learning models for classification: Logistic Regression, Naive Bayes, Decision Trees, Random Forest, and SVM. These techniques receive a vector of features for each sentence and produce a label associated with the central word. Therefore, our method is a multi-class classification problem, where an input sequence produces a single label.
These are standard methods in machine learning and are typically used in many works. They cover a wide range of techniques with different characteristics. These methods can solve similar problems efficiently and are easy to interpret and implement.
Logistic Regression can be used for multi-class classification using a one-vs-all approach. It uses a sigmoid function that outputs a value between 0 and 1 to model the probability of an instance belonging to a positive class. For multiple classes, separate models are trained for each class. The instance is assigned to the class with the highest predicted probability. We also utilize a low regularization parameter to prevent overfitting.
Naive Bayes is a probabilistic algorithm that assumes feature independence. It calculates the probability that an instance belongs to a specific class. The steps involve computing prior probabilities, class conditional probabilities, and assigning the instance to the class with the highest posterior probability. We rely on the Multinomial Naive Bayes method typically used in text classification and a low regularization parameter.
Decision Trees are decision support tools used for both regression and classification tasks. They create a tree-like structure where each node represents a decision based on the input features. In multi-class classification, decision trees split nodes to predict class labels. The process involves calculating probabilities, branching, and assigning instances to the class with the highest posterior probability. The Gini criterion is used to split the data at the nodes.
Random Forest is an ensemble learning technique that combines multiple decision trees to create robust and accurate models. Predictions from individual trees are aggregated, which reduces overfitting and improves accuracy. They are robust against noisy data and outliers due to the averaging effect of multiple trees. Training individual trees can be parallelized for computational efficiency, making it faster than other methods. In the experiments, we configure Random Forest with six different decision trees, the Gini criterion to measure the quality of a split, and no limits for the depth of the trees.
Support Vector Machines (SVMs) excel in high-dimensional feature spaces and find optimal hyper-planes to separate different classes. In the experiments, we use two versions of SVM: a Linear SVM (SVM Linear) that finds a hyperplane that best separates classes in the feature space, following a one-vs-all approach and a Radial Basis Function SVM (SVM RBF) that handles nonlinear data by mapping it to a higher-dimensional space using kernel functions. The latter follows a one-vs-one approach so that each classifier differentiates between two classes. The decision boundary becomes a nonlinear surface in this case. In our experiments, we also use a small regularization parameter.
The time complexity of these algorithms varies depending on the dataset size and the number of features. Logistic Regression and Naive Bayes are less computationally expensive than SVMs and Random Forests, especially on large datasets. Logistic Regression and Naive Bayes are more memory-efficient and computationally less demanding, making them suitable for large datasets with many features. Random Forests and SVMs, particularly with non-linear kernels, are both memory and computationally intensive, especially as the size of the dataset grows. Random Forests require memory to store multiple trees, while SVMs require substantial memory for storing support vectors and the kernel matrix.
6. Metrics and Experimental Design
We use standard metrics to evaluate the performance of the methods. In particular, we rely on the
,
, and
for each label and classifier. Since our dataset is unbalanced, with certain labels over-represented—see
Table A1—, we calculate the average for each label and then compute the average of the ratios for all the labels.
The metrics are based on the number of true positive classifications,
, true negatives,
, false positives,
, and false negatives,
. The
represents the rate of positive predictions that are correctly classified and is given by the following expression:
The
is the rate of positive cases that the classifier correctly predicts and is calculated as:
and the
is the harmonic mean of the
and
, given by the following:
In the first part of the experiments, we establish a global ranking of the methods using these metrics with the test set. Then, we evaluate the performance for each template directory separately, which allows us to understand the behavior of each technique to data similar to the training set and data that are completely new, such as in the last template.
Choosing the best method, we rank the labels according to their precision range. We show the labels that are classified with high precision and those that are more difficult to classify. This will allow us to identify the labels that need to be considered for improving the method. We rely on confusion matrices to assess the errors produced between different labels. This complements the previous information and permits us to identify the labels usually classified as others. This can guide us in defining new features to differentiate between those labels.
In the last experiment, we study the influence of features on the accuracy of our method. We explore the benefit of the custom features with and without TF-IDF. We gradually increase the size of the context window by adding one word on the left and right of the central word. Therefore, we consider an odd number of words, ranging from one to eleven, i.e., from five to fifty-five features.
Implementation Details
PDF files were converted into raw text using the
pdftotext program. We removed strips of dots as they typically appear for horizontal lines during the conversion from PDF to text files. For convenience, we created a text file for each training and test directory, as explained in
Section 3.2, each containing fifty different invoices.
These files were loaded using the Pandas library, separating the sentences and labels into two dataframes. The training data were split into two sets: 80% for training and 20% for validation. The validation set was used to evaluate the training process and tune the hyperparameters of the methods. Similarly, the test directory was loaded into another pair of dataframes. We carried out a final training step, where the whole training data—including the validation set—were used to train the models. This usually provides slightly better results. The test set was only used at the end to evaluate the models and report the results.
The hyperparameters of the models were tuned by hand. We tested default configurations and checked for the most relevant hyperparameters. In the case of SVM, we tested different kernels and various regularization parameters. Decision Trees and Random Forests were calculated without depth limit and with one sample at each leaf node. For the rest of the models, we tested various regularization parameters.
We created a pipeline that transformed the input data into a new dataframe, with the TF-IDF and custom features, and then called the corresponding machine-learning technique for fitting the classifier parameters to the training data.
In inference mode, the steps of the pipeline are as follows:
Convert the invoice document from PDF to text format;
Create a file with eleven-word sentences from the text file as explained in
Section 3.2;
Compute TF-IDF features for each sentence;
Calculate custom features for the words of each sentence;
Concatenate both types of features;
Feed one of the machine-learning techniques with a dataframe containing these features.
The output is a new dataframe with the detected labels. In a post-processing step, a JSON file can be created using the detected labels and the corresponding central words of every sentence. Labels with multiple words can be easily extracted by concatenating consecutive values in the output data.
7. Results
In this section, we assess the performance of methods by comparing their
,
, and
on the test set.
Table 5 compares global metrics for each method by first calculating the scores for each label and then computing the average, giving the same weight to each label.
The result of Naive Bayes is poor compared with the other methods, with a recall significantly higher than its precision. Thus, this method detects many false positives but it allows detecting a larger rate of true positives. This method is fast, but its scores are low.
The performance of SVM Linear and Logistic Regression is much better, with a precision and recall higher than 80%. In the case of Logistic Regression, the recall is lower than the precision, so this method underestimates some labels. The Decision Tree yields a similar performance, with a balanced precision and recall.
Random Forest and SVM RBF provide similar results, with SVM RBF yielding the best performance of all methods. The main difference resides in a higher recall for SVM; thus, this method tends to detect a larger rate of positive cases.
The performance of each method for every template is displayed in
Table 6. These results demonstrate the capacity of each one to adapt to new data. The precision is high for the first six templates but smaller for the last template, because the training data do not include samples from that template.
Logistic Regression is typically better than Decision Trees and SVM Linear for templates used during training. However, it is worse for invoices with new layouts and vocabulary. On the other hand, although SVM RBF provides the best results for documents already seen by the classifier, Random Forest yields better results for new types of documents. Therefore, the generalization capability of Decision Trees and Random Forests is better than the rest of the techniques. Nevertheless, we must note that the number of different templates in this dataset is not high and there is low variability in the training set.
7.1. Precision for Each Label
A closer examination of the results of the best model, SVM RBF, reveals that it achieves high precision for most labels, as shown in
Table 7. The method provides a precision higher than 99% for 41% of the labels and higher than 90% for another 31%. This means that most labels are classified with high precision. Only 6% of labels have a precision smaller than 70%.
There is only one label with a precision smaller than 50%,
K2d, which is related to the access toll rate. Looking at
Table A1, we observe that this label appears 5000 times in the training set, i.e., in only one template. Label
K4d is also classified with low precision and, similarly, appears in one template.
Analyzing the precision of other labels, we realize that it is related to the number of occurrences in the training set. Therefore, we may conclude that the method classifies the labels correctly if they appear in many documents of various templates.
Figure 2 depicts the confusion matrices for the SVM RBF method. Since the number of labels is too large, we have divided the matrix into six parts with eighteen labels each. We observe that the predictions consistently match the ground truths for most labels. The null values in the diagonal correspond to labels that do not appear in the test set.
One of the most notable confusions is between labels k2d and k4d, which are related to marketer prices. These amounts are similar, so this confusion may seem reasonable. Additionally, these labels are present in only one template of the training set. Another significant confusion is given by F4u and F5u, which represent the start and end of the billing period. These labels are similar and only appear in two templates. It is also interesting to note that the OO label, which occurs with high frequency, is hardly confused with another label. In this sense, the method seems to learn patterns associated with out-of-class words correctly, which validates our strategy of integrating these types of words in the training data. This is not the case for other techniques where this label presents the largest mispredictions, such as Naive Bayes or SVM Linear. In these cases, it is necessary to implement several strategies to reduce the impact of imbalanced classes.
7.2. Analysis of Feature Influence
This section examines how the inclusion of different features impacts the precision of the methods. Our previous results were calculated using eleven words per sentence, yielding fifty-five custom features. We may reduce the number of words from zero to eleven.
In our first experiment, we compare the precision of the methods without using TF-IDF.
Table 8 shows the results when we increase the number of custom features. If we only train with the central word, i.e., five custom features, the precision of all methods is too low, with Decision Tree and Random Forest providing slightly better results. The precision significantly augments with fifteen and twenty-five features, i.e., three and five words, respectively, although it is not sufficient to obtain good performance. With seven to eleven words, Random Forest, SVM RBF, and Decision Tree yield good results, with the former consistently providing the best results.
Table 9 shows the precision of methods using TF-IDF with and without custom features. We observe that Logistic Regression provides the best results for TF-IDF, although the precision is low. The performance of other methods is poor, especially for SVM RBF. Random Forest provides the second-best result. Using five custom features significantly improves the precision in general.
Logistic Regression remains the top-performing method in this scenario. However, the gap with other methods, particularly Random Forest and SVM RBF, has narrowed considerably. These latter methods show a significant improvement in precision. Using more custom features allows for improving the results of the methods, except for Naive Bayes. SVM RBF provides the best results in these cases, followed by Random Forest, Logistic Regression, and Decision Tree.
Comparing the results in
Table 8 and
Table 9, we may conclude that TF-IDF is important when the number of custom features is small and is necessary to attain high precision. SVM RBF, Random Forest, and Decision Tree provide good results if TDF-IF is not used. Decision Tree yields similar results in both cases. It is worth mentioning that Random Forest is slightly better than SVM RBF when not using TF-IDF and SVM RBF is better when we use it.
TF-IDF is much more important for Naive Bayes and Logistic Regression, which obtains the greatest gain in precision from these features, ranking in the third position before the Decision Tree.
This information is also represented in the graphics of
Figure 3 and
Figure 4, respectively. SVM RBF and Random Forest are the best methods for a given set of features in both configurations, followed by Decision Tree. Logistic Regression and Naive Bayes perform much better with TF-IDF. In this setting, Logistic Regression outperforms Decision Tree.
8. Discussion
In this work, we proposed a methodology to extract relevant information from electricity invoices based on text data. We converted the annotated bills, originally in PDF format, into text files. These files were then pre-processed to generate tuples of sentences and their associated labels. Each sentence contained eleven words, with the label associated with the central word. Our method employed a bag-of-words strategy and a set of custom-designed features, specially designed for electricity data. We assessed the performance of various machine learning algorithms.
Our results demonstrated that TF-IDF features, used in the bag-of-words strategy, were key for obtaining an acceptable precision. They also showed that our custom features were necessary to improve the results. We demonstrated that the number of words used to calculate features was also important, with more than seven being sufficient to attain high precision in general.
We studied the performance of various standard machine-learning techniques and found that Naive Bayes did not provide good results, but its precision significantly increased with TF-IDF, although still insufficient. Logistic Regression was the method that benefited the most from TF-IDF.
The techniques that provided the best results were SVM RBF and Random Forest. The latter yielded the best precision with custom features alone. SVM RBF, on the other hand, provided the best results when both TF-IDF and custom features were employed. Logistic Regression stood out from the others using TF-IDF and five custom features. However, TF-IDF alone was not enough to tackle this problem for any method. We may conclude that TF-IDF was important as a baseline for all the methods but the custom features were necessary to achieve higher precision.
Looking at the classification rate of SVM RBF, we observed that most labels were classified with a precision higher than 90% and only two labels obtained a precision lower than 60%. The precision obtained for the OO label validates our strategy of integrating out-of-class words into the training process. We believe that this approach is flexible enough to generalize to new words not belonging to any predefined class. Since our model discriminates based on features extracted from the word and its context, the likelihood of out-of-class words coinciding with one of the predefined classes must be low.
We also assessed the performance of the methods for each template. The precision for templates used during training was significantly higher. In this case, SVM RBF provided the best results and Random Forest was better for unseen invoices. The difference in precision for seen and unseen data was bigger than 30%, which indicates some overfitting to the training data.
This is reasonable because the number of templates in this dataset is small. The variety of data is not high since each template shares a common vocabulary and layout.
Therefore, one strategy to improve the performance of these methods is to increase the number and variety of templates in the dataset. Increasing the number of custom features at the word level is unlikely to improve the results since the precision of most methods does not significantly improve after seven words. It would be more interesting to explore new features related to the position of words inside the documents and their relative position to other words and labels.
This work has several limitations. For example, due to the small diversity of documents in the dataset, the generalization capacity of our method is limited and may produce unsatisfactory results for new invoices. We also note that the dataset contains region-specific invoices and, therefore, the performance of our method will be poor for invoices from other regions.
Regarding the training of the methods, we did not conduct an extensive search for the best hyperparameters. We tuned the hyperparameters by hand, leveraging our domain expertise and understanding of the models. Consequently, the results of each technique may provide slightly better results for other configurations, at the expense of a considerable increase in training and validation time. However, we do not expect a significant change in the results of our study.
In this study, we assessed the performance of our method using several machine-learning approaches, including Logistic Regression and Naive Bayes which can be seen as baseline models. These models are chosen due to their widespread use and effectiveness in similar tasks, providing a robust benchmark for evaluating our approach. These well-established models provided a comprehensive comparison, highlighting the advantages of our approach. It would have been interesting to compare with other state-of-the-art models, but adapting such methods to our dataset requires a significant effort. Future research can build on our results to compare with new or existing methods, further validating our approach.
Our method can enhance the automation of information extraction from electricity invoices, leading to more efficient business processes and reduced manual labor. This can benefit utility companies, marketers, and distributors by improving data accuracy and accelerating customer service response times. This scalable solution can easily be adapted to diverse invoice formats and contents.
9. Conclusions
This work analyzed the performance of various machine-learning techniques for extracting many fields from electricity invoices. These are complex documents that include many different contents and layouts. We relied on a bag-of-words strategy and custom-designed features tailored for these invoices.
We demonstrated that combining TF-IDF and our custom features was necessary to obtain high precision in several methods. The techniques that provided the best results were SVM RBF and Random Forest, although the results of the Decision Tree were also competitive. Logistic Regression was better than Decision Tree when we employed TF-IDF features. We also showed that the precision for already-seen template documents was higher than for new templates. This is reasonable as the dataset contains a reduced number of templates.
In future works, we will increase the number of templates in the dataset. This is important for augmenting the variability of documents with richer vocabulary and layout. We will explore new features based on the position of words and the relative spatial distribution of labels. We are also interested in using deep learning techniques, especially those based on Large Language Models and multimodal architectures.