Receipt Information Extraction with Joint Multi-Modal Transformer and Rule-Based Model

Mifsud, Xandru; Grech, Leander; Baldacchino, Adriana; Keller, Léa; Valentino, Gianluca; Muscat, Adrian

doi:10.3390/make7040167

Open AccessArticle

Receipt Information Extraction with Joint Multi-Modal Transformer and Rule-Based Model

by

Xandru Mifsud

^†

,

Leander Grech

^†

,

Adriana Baldacchino

^‡

,

Léa Keller

,

Gianluca Valentino

^*

and

Adrian Muscat

^*

Department of Communications and Computer Engineering, University of Malta, MSD 2080 Msida, Malta

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

^‡

Current address: Department of Computer Science, University of Oxford, Oxford OX1 3QD, UK.

Mach. Learn. Knowl. Extr. 2025, 7(4), 167; https://doi.org/10.3390/make7040167

Submission received: 9 August 2025 / Revised: 2 December 2025 / Accepted: 9 December 2025 / Published: 16 December 2025

Download

Browse Figures

Versions Notes

Abstract

A receipt information extraction task requires both textual and spatial analyses. Early receipt analysis systems primarily relied on template matching to extract data from spatially structured documents. However, these methods lack generalizability across various document layouts and require defining the specific spatial characteristics of unseen document sources. The advent of convolutional and recurrent neural networks has led to models that generalize better over unseen document layouts, and more recently, multi-modal transformer-based models, which consider a combination of text, visual, and layout inputs, have led to an even more significant boost in document-understanding capabilities. This work focuses on the joint use of a neural multi-modal transformer and a rule-based model and studies whether this combination achieves higher performance levels than the transformer on its own. A comprehensively annotated dataset, comprising real-world and synthetic receipts, was specifically developed for this study. The open source optical character recognition model DocTR was used to textually scan receipts and, together with an image, provided input to the classifier model. The open-source pre-trained LayoutLMv3 transformer-based model was augmented with a classifier model head, which was trained for classifying textual data into 12 predefined labels, such as date, price, and shop name. The methods implemented in the rule-based model were manually designed and consisted of four types: pattern-matching rules based on regular expressions and logic, database search-based methods for named entities, spatial pattern discovery guided by statistical metrics, and error correcting mechanisms based on confidence scores and local distance metrics. Following hyperparameter tuning of the classifier head and the integration of a rule-based model, the system achieved an overall F1 score of 0.98 in classifying textual data, including line items, from receipts.

Keywords:

document analysis; deep learning; rule-based model; multi-modal; information extraction; receipts

1. Introduction

Automating document processing in accountancy (for tax, expense reporting, and auditing) requires extracting semantic data from receipts and invoices. While some tasks need most of the information present on the receipt, i.e., including line items, others only require a subset, namely the supplier’s name, date, tax, and total amount.

Historically, receipt information extraction systems employed template-matching methodologies. This technique relied on the inherent spatial structure of the documents, proving effective only for sources with highly consistent and predictable layouts. Consequently, these methods demonstrated limited generalizability, necessitating the explicit definition of layout parameters for each unseen document source. A substantial advancement occurred with the integration of convolutional neural networks (CNNs) and recurrent neural networks (RNNs), which enabled the development of models capable of robustly generalizing across diverse document topologies. More recently, the field has transitioned to multi-modal transformer architectures. These contemporary models integrate a combination of textual, visual, and layout features as input, yielding a significant improvement in overall document-understanding capabilities and establishing the current state of the art in receipt analysis.

In this work, we are interested in developing a model that extracts all useful information, including line items from thermally printed receipts. The overall architecture and pipeline of the model is illustrated in Figure 1. The information extraction model consists of the joint use of a pre-trained open-source neural multi-modal transformer (LayoutLMv3 [1]) and a rule-based model developed as part of this work. The rule-based model is used to improve upon the output from the previous stage (LayoutLM). Potentially, it can also be used as a standalone receipt information extraction system. Following hyperparameter tuning and training of the classifier head, the LayoutLM model on its own achieved an overall F1 score of

0.953

on the real-world-receipt test set and an F1 score of

0.98

with the addition of the rule-based model. In addition, following a quantitative and qualitative study of a number of options, the DocTR Optical Character Recognition (OCR) model was selected to scan and obtain the bounding boxes and word tokens, and an ablation study of the various combinations of classification models and data was carried out.

The following is a summary of our contributions; (a) A comprehensively annotated dataset comprising real-world and synthetic receipts was specifically developed for this study. This was necessary since benchmark datasets SROIE (Scanned Receipts Optical Character Recognition and Information Extraction [2,3]) and CORD (Consolidated Receipt Dataset [4]) are either not fully labelled or some items are redacted. (b) The open-source LayoutLMv3 transformer-based multi-modal model was augmented with a classifier model head, which was tuned for classifying textual data into 12 predefined labels pertaining to receipt information (e.g., date, price, name of shop). (c) A manually defined rule-based model was developed and integrated with the multi-modal transformer (LayoutLMv3) model to further improve upon and ‘error correct’ the tokens classified by the LayoutLM model.

This section introduces the receipt information extraction task and outlines the scope and contributions of this work. The rest of this paper is organized as follows. Related work is reviewed in Section 2. Section 3 and Section 4 describe the development of the dataset and the selection of the OCR model. Section 5 and Section 6 describe the multi-modal transformer and rule-based models. The results are tabulated and discussed in Section 7, and Section 8 concludes this paper.

2. Related Work

In this section, we review related work that tackled receipt information extraction tasks. Receipt information extraction tasks require both textual and spatial analyses of documents represented as images. One way to extract information is to manually hard-code rules, an approach taken by the authors of [5], who used a combination of techniques, such as regular expression matching and spatial search algorithms, in order to extract semantic information from receipts, and the authors of [6] made use of regular expressions in order to establish a baseline for receipt semantic extraction.

The methods described in [5] are designed to extract the transaction date and price or total. The system utilizes natural language processing and statistical techniques. More specifically, keywords associated with categories (e.g., “price,” “total,” “amount”) are identified. A spatial search (nearest-neighbor search) is then conducted for text bounding boxes nearest to the keyword to find the associated value. For price parsing, the system selects the keyword–price pair that is the furthest down the page, as this usually represents the total price for a given receipt image. The system correctly identifies the price

72 %

of the time when tested on a private test dataset composed of 50 unique receipt images. The results for “transaction date” extraction are not given.

The methods in [6] define functions based on regular expressions and logic to extract eight categories (date, address, price, tax rate, currency, product name, product price, and product amount) of information from thermally printed receipts. In addition, vendor name is extracted by simply taking the text at the top of the receipt. If it is not the Swedish word for receipt or the text’s length is less than two, then the next one is taken instead. The system is tested on a test set of 90 receipts (the large majority of them being Swedish and not publicly available due to privacy regulations) and achieves an overall F1 score of

0.515

(micro-average) and

0.710

(macro-average). Some categories achieve high scores (date =

0.923

and currency =

0.926

), whilst the rules for extracting line item data achieve an overall F1 score of

0.127

.

On the other hand, early commercial solutions relied on template methods, where the text and spatial rules are coded for specific documents and, therefore, do not generalize across documents with different layouts and styles. Therefore, learning from data via machine learning models is considered to mitigate the generalization problem. Palm et al. [7] made use of recurrent neural networks, specifically long short-term memory (LSTM) networks to develop CloudScan, an invoice information extraction system. This system makes use of positional data alongside textual data tokenized as N-grams. The system was trained and tested on 326,471 samples obtained directly from customer feedback and extracts key information only (i.e., total, tax, date, name of shop). The LSTM model achieved an F1 score of 89.1% (88.7% bag-of-words (BoW) baseline model) on seen forms and 84.0% (78.8% BoW) on unseen forms.

Most datasets used for the receipt information extraction task are not publicly available, mainly due to the difficulty in obtaining consent from various data owners. Nevertheless, the Scanned Receipts Optical Character Recognition and Information Extraction (SROIE) dataset [2,3] was published for benchmarking purposes. The dataset contains approximately 1000 scanned and labelled receipts split into approximately 600 training/validation and 400 testing examples. However, the labelled entities for this dataset are limited to company name, date, address, and total. The Optical Character Recognition (OCR) results are provided with each image, along with the bounding boxes. A second benchmarking dataset is the Consolidated Receipt Dataset (CORD) [4], which contains over 11,000 Indonesian receipts collected from shops and restaurants with entity bounding boxes and labels consisting of five superclasses and forty-two subclass labels. This dataset was created to evaluate post-OCR parsing, but it has also been used to evaluate the extraction of semantic information from documents; it also includes some line items. However, only 1000 of the receipts are publicly available, and sensitive information, such as shop names, addresses, and telephone numbers, are redacted. Nonetheless, these two benchmarking datasets stimulated research in receipt information extraction, which followed advances in transformer networks.

The attention mechanisms within large language transformer models, such as the Bidirectional Encoder Representations from Transformers (BERT) model [8], resulted in significant strides in accuracy and semantic extraction from texts. In addition, the pre-training step improved the performance of models characterized by a large number of parameters. When masking is used to implement the pre-training step, the model is trained to predict the next word and next sentence or whether two sentences can appear next to each other in a text document. Although applying BERT directly to the receipt information extraction task results in poor outcomes [6], various BERT-inspired models have managed to improve the state of the art. One such model is LayoutLM, as proposed by [9]. The authors observed that existing BERT-based models did not properly take advantage of a document’s rich visual information. This led to the implementation of a 2D positional embedding for each token and an image embedding for the token, along with the standard input representation of text embeddings and position embeddings. The layout features are used alongside the textual features during the pre-training task, while all three features (layout/textual/visual) are used in the downstream task. The pre-training objectives follow a principle similar to that of the masked language model (MLM) used to train BERT. In particular, LayoutLM used a Masked Visual-Language Model (MVLM), where a percentage of text tokens is randomly masked out, and the model is tasked with predicting these tokens based on other text tokens and visual and layout information present in the input. The LayoutLM large model (343 M parameters) achieved an F1 score of 95.2% on the SROIE dataset and 94.6% in the case of the smaller model (113 M parameters).

LayoutLMv2 [10] introduced an Image–Text-Matching (ITM) objective, in which the model is tasked with predicting whether pairs of cross-modal text and visual and layout information match each other, thus encouraging better semantic relationships to be learned. The F1 scores for the model are 96.25% for the small model (100 M parameters) and 97.81% for the large model (400 M parameters). Another notable pre-training objective is the Word-Patch Alignment (WPA) objective, introduced in LayoutLMv3 [1], where the model is tasked with predicting whether the corresponding visual patch of a particular text token has been masked. LayoutLMv3 achieved F1 scores of 96.56% (133 M parameters) and 97.46% (368 parameters) on the CORD dataset. In comparison, LayoutLMv2 achieved 94.95% (200 M parameters) and 96.01% (426 M parameters) on the same CORD dataset.

Along with general improvements in the architectures of the encoder networks of these document-understanding models, pre-training objectives were key to increasing accuracy across the board. The LAMBERT model [11] characterised in 125 M parameters and pre-trained on a dataset of size 75 M achieved 96.93% on SROIE (98.17% on the leaderboard) and 94.41% on the CORD dataset. In comparison, the LayoutLM models are pre-trained on a smaller dataset of 11M examples. On the other hand, GraphDoc [12] is a multi-modal graph attention-based model, where a graph structure is injected into the attention mechanism to form a graph attention layer such that each input node only attends to its neighbourhoods. Pre-training is carried out on a masked sentence modelling task using a very small dataset (320 k), and the model achieved an F1 score of 98.45% on SROIE and 94.41% on the CORD datasets.

More recently, more datasets have been made publicly available [13,14], additional model iterations have been proposed [15,16,17], and general purpose multi-modal large language models have been benchmarked on receipt information extraction tasks [13,18].

AMuRD [13] is a multilingual (English and Arabic) receipt dataset (47,720 examples in total) that has been human-annotated with key information and line items, and in addition, it includes labels pertaining to categories of product description. RealKIE [14] consists of five novel datasets for key information extraction in enterprise documents and includes 370 labelled Federal Communication Commission (FCC) invoices that contain cost information from television advertisements.

DocGraphLM, proposed in [15], combines LayoutLMv3 with graph semantics. This is achieved with a joint encoder architecture and a link prediction approach to reconstruct document graphs. The model achieved an F1 score of

96.93 %

on the CORD dataset. DocExtractNet [16] integrates image enhancement, a precision-hinting strategy, and a cross-modal fusion module to improve the performance of the pre-trained LayoutLMv3 model in information extraction tasks. The resulting model achieved an F1 score of

97.38 %

on the CORD dataset, up from

96.56 %

on the baseline LayoutLMv3. In [17], the authors combine multi-modal alignment and sequence modelling by integrating CLIP (Contrastive Language-Image Pre-training) and a Bidirectional Gated Recurrent Unit (BiGRU) to achieve a lightweight and computationally efficient model. The framework achieves

93.1 %

on the CORD dataset with a model size of 18 M and is four times faster (when edge-deployed) than LayoutLMv3 (model size of 410 MB and F1 =

95.6 %

on CORD).

General-purpose multi-modal large language models have been studied in [13], where the authors finetuned a 7 B parameter LLaMA V1 in the information extraction task and achieved an F1 score of

97.06

on the AMuRD dataset. In [18], eight multi-modal large language models from the GPT-5 (https://platform.openai.com/docs/models, accessed on 21 October 2025), Gemini 2.5 (https://deepmind.google/technologies/gemini/, accessed on 21 October 2025), and Gemma 3 (https://ai.google.dev/gemma, accessed on 21 October 2025) families are benchmarked on the information extraction task using zero-shot prompting. Gemini 2.5 Pro achieved the highest accuracy score:

87.46 %

on the SROIE dataset (scanned receipts) and

96.5 %

on the Donut [19] synthetic dataset (clean invoices).

In summary, general-purpose LLMs underperform when compared to specialized document analysis models, and so far, most recent research is on further specializing transformer-based document analysis models for receipt information extraction tasks and, to a lesser extent, developing models that are less demanding on computing resources.

3. Dataset Collection, Curation, and Maintenance

We are interested in extracting useful information from thermally printed receipts, including line items and their attributes, in addition to key information, i.e., name of shop, date, total, and taxes. Since the SROIE and CORD datasets are limited in this respect (annotations and type of receipt), we compiled our own dataset made up of real-world receipts (approximately 465 receipts labelled and annotated with a combination of OCR output and manual work) and a synthetic dataset (approximately 120 samples). The synthetic dataset was machine-generated using purposely defined templates, and it was automatically annotated with appropriate labels.

3.1. Dataset Overview

The first step in data gathering involved the selection of 12 classes that capture the key information typically found in a thermally printed receipt: name of shop, address, telephone, VAT info, date and time, line items, unit price, quantity, amount, total price, VAT included, and miscellaneous (Misc). These categories are the ones used to annotate the text extracted from the receipts via OCR. In addition, we also augmented the real-world dataset with synthetically generated receipts as an attempt to mitigate overfitting during the training of the neural network transformer model.

3.1.1. Real-World Dataset

A total of 465 real-world thermally printed receipts were collected and scanned or photographed. These were then manually annotated with bounding boxes around the full text that belongs to each of the 12 classes. On the other hand, the OCR will return bounding boxes around each individual word. The manually annotated bounding boxes were drawn around the entire field: for example, a singular bounding box encompassing the address as a whole. However, for our token classification task, we required word-level labels, texts, and bounding boxes. To achieve this, each manually labelled document was passed through an OCR system to extract the text within the bounding boxes. In contrast, the OCR system extracts word-level bounding boxes along with the text in each bounding box. By automatically checking the intersections with the manually annotated document (intersection over union, IoU), we obtained the word-level classification of all detected textual data in the document. This process is outlined in Figure 2.

In addition, if a bounding box returned by the OCR does not intersect with any of the manual annotations, then the associated word is assigned the Misc class (Other in Figure 2), which is a catch-all class for all extra text found in the document.

The process outlined for creating the real-world dataset is error-prone for two reasons: (a) OCR technology is still not robust enough: It may return garbled text (especially on low-quality images), and some text may not be detected at all. Moreover, (b) alignment between the OCR bounding boxes and the manual bounding boxes may not be accurate, resulting in textual data being mislabelled. Therefore, a bespoke tool to carry out manual edits to word-level labelling was developed. An example of a receipt and automatically generated and curated annotations is shown in Figure 3, and some further examples from the dataset are shown in Figure 4.

The dataset is organised into a training set (316 samples) and a test set (147 samples). The training set comprised 230 unique shop names. We considered this a good indicator of the variety of spatial forms, font types, and layouts in the dataset. The test set is characterized with 119 unique shop names, and 89 of these (61%) are not in the training set, i.e., these constitute unseen forms, fonts, and layouts.

3.1.2. Synthetic Dataset Generation

In conjunction with the real-world dataset, a smaller synthetic dataset with “perfect” image quality and annotations was generated. Templates were defined to generate synthetic receipts. Along with general page information, such as margin width, length, and category, the template defines the sequence of attributes (line item, date, time, etc.), their position with respect to the page, alignment (left, right, and centre), and whether they belong to the same line as their predecessor. The template also contained specific information about the line-item column headers (quantity, description, price, etc.). Each attribute was generated sequentially, as specified in the template. Some attributes always follow the same format: For example, date and time attributes are generated in the same format. For some attributes, there is an element of randomization, such as in the column names (e.g., the quantity header may be one of qty, qty., or Qty). A number of randomized miscellaneous data were also added to achieve a label distribution similar to that of real-world data. Synthetic data were added to training data for the respective models of the ablation study in Section 7.

4. Optical Character Recognition Models

Optical Character Recognition (OCR) is a fundamental technology in computer vision and document analysis that aims to automatically detect and recognize textual information from images or videos. Traditionally, OCR systems were designed for scanned documents with clean, printed, and horizontally aligned text [20]. Classical engines such as Tesseract and its modern Python interface, PyTesseract [20], became widely adopted for document digitization and layout understanding, providing robust rule-based and LSTM-based recognition for machine-printed text.

However, with the proliferation of scene text in natural environments (such as street signs, advertisements, and product packaging), modern OCR has evolved to handle a wide variety of complex scenarios characterized by uneven illumination, background clutter, distortions, and diverse font styles [21]. Deep learning frameworks, such as EasyOCR [22] and DocTR [23], have emerged as general-purpose OCR toolkits, leveraging convolutional neural networks and transformer-based architectures to detect and recognize text across multiple languages and irregular layouts.

In recent years, scene text detection and recognition have attracted significant research attention due to their importance in applications such as autonomous driving, assistive reading systems, and visual question-answering [24,25]. A major challenge in this domain is the arbitrary shapes of text regions, which often appear in curved, rotated, or irregular layouts rather than the traditional horizontal orientation. Conventional bounding-box-based detectors struggle to localize such complex shapes precisely, motivating the development of new representations and architectures tailored to irregular text [26].

To address these challenges, numerous deep learning-based methods have been proposed for arbitrary-shaped text detection and end-to-end recognition. Early approaches extended axis-aligned detectors by adopting rotated or quadrilateral bounding boxes [27,28], but these methods often failed to capture curved text boundaries. More recent innovations introduced segmentation-based and contour-based frameworks, which model text regions at the pixel or instance level. For instance, Text Growing on Leaf proposed a flexible region-growing mechanism to progressively expand text regions, effectively adapting to arbitrary geometries [29]. Zoom Text Detector employed a coarse-to-fine strategy that dynamically refines text boundaries with high precision [30]. Meanwhile, Concentric Mask-Based Arbitrary-Shaped Text Detection introduced a novel mask representation to describe complex text contours efficiently [31].

End-to-end frameworks, such as text-spotting transformers [32], SwinTextSpotter [33], and related transformer-based architectures, have further unified text detection and recognition within a single trainable pipeline. These methods leverage self-attention mechanisms and multi-scale feature fusion to jointly localize and decode text, offering robustness against scale variation and irregular shapes. The integration of transformer backbones and hierarchical vision architectures, such as the Swin Transformer [34], has significantly advanced the performance of arbitrary-shaped text-spotting benchmarks.

Despite these advances, several open challenges remain. While lightweight OCR libraries, such as PyTesseract, DocTR, and EasyOCR, offer accessibility and broad language coverage, they generally underperform compared to state-of-the-art research models on complex scene text benchmarks. Persistent issues include robustness under extreme distortions, the efficient processing of high-resolution images, and generalization to unseen text styles and languages [35]. Ongoing research continues to explore more adaptive representations, efficient transformer-based models, and multi-modal learning paradigms that integrate textual, visual, and contextual cues for holistic text understanding in the wild [35,36].

Given that thermal receipt OCR, which is considered in this paper, is a structured, semi-regular text recognition problem, it can potentially be well tackled by DocTR, which can handle mild distortions and noise. On the other hand, arbitrary-shaped text models would be more suitable for complex scenarios involving curved text, e.g., logos, urban street scenery, etc., and would have higher associated computational costs.

4.1. EasyOCR, PyTessaract, and DocTR Models

We considered three distinct open-source OCR engines: EasyOCR (https://github.com/JaidedAI/EasyOCR, accessed on 22 June 2023), PyTesseract (https://github.com/h/pytesseract, accessed on 22 June 2023), and DocTR (https://www.mindee.com/platform/doctr, accessed on 22 June 2023). We also excluded commercial OCR services. In this section, we start with a short description of these open-source candidates and follow with a discussion on how we finally selected DocTR for our experiments.

EasyOCR, developed on the PyTorch deep learning framework, leverages the power of convolutional neural networks (CNNs) for spatial feature extraction and recurrent neural networks (RNNs) for modelling text sequences. EasyOCR supports the use of graphics processing units (GPUs) out of the box, and this feature makes it well-suited for OCR tasks, where computation speed is required. In addition, it supports more than 80 languages, although accuracy varies across languages.

PyTesseract is a Python binding to Google’s Tesseract OCR engine [20]. Tesseract employs a composite approach, which combines traditional rule-based methods with modern long short-term memory (LSTM) networks for character recognition. The architecture is modular, thus enabling various configurations and extensions, although customization is often seen as a feature more relevant to those looking to adapt PyTesseract to specific needs, such as improving the accuracy of specific fonts. One notable drawback is that it is not equipped with GPU capabilities outside of the box. However, it does not require high computational costs when compared to deep learning and transformer-based methods.

DocTR [23] is a transformer-based OCR that exhibits significant performance improvements compared to other state-of-the-art OCR technologies. DocTR employs geometric unwarping and illumination transformers, which relax the need for image pre-processing; this is key to the real-world effectiveness of receipt labelling systems (which will most likely deal with many low-quality images).

4.2. Comparison of OCR Models

There are limited studies comparing OCR models in terms of an accuracy metric in the academic literature. The studies in [37,38] compare OCR models on medical records that take the form of well-organized medium–high quality forms. In [38], various image pre-processing techniques are applied to scanned medical documents (images), and the OCR engines are compared with or without pre-processing techniques on the basis of average accuracy based on the Levenshtein distance function. Tessaract ranks first in accuracy, and it is closely followed by DocTR and EasyOCR in the presented order. A more useful study for our case is reported in [39], where DocTR and EasyOCR are compared on the CORD and SROIE receipt datasets. The character error rate (CER) for DocTR is three to five times smaller than that of EasyOCR. In addition, DocTR’s CER is less than double that of the Azure (https://azure.microsoft.com/en-us/products/ai-services/ai-vision, accessed on 27 June 2023), Textract (https://aws.amazon.com/textract/, accessed on 27 June 2023), and Google OCR (https://cloud.google.com/use-cases/ocr, accessed on 27 June 2023) commercial systems. Unfortunately, Pytessaract is not included in the study. We, therefore, carried out a study on a sample from our receipt dataset to compare EasyOCR, Pytessaract, and DocTR directly. As evident in Table 1, we found that DocTR’s performance is superior to that of both Pytesseract and EasyOCR. Therefore, on the basis of accuracy, we chose DocTR as the OCR engine in our receipt labelling pipeline. Moreover, DocTR uses pre-trained detection and recognition models. By default, we use db_resnet50 (https://huggingface.co/smartmind/doctr-db_resnet50, accessed on 27 June 2023) and crnn_vgg16_bn (https://huggingface.co/Felix92/doctr-dummy-torch-crnn-vgg16-bn, accessed on 27 June 2023).

5. Token Classification Using LayoutLMv3

In this section, we briefly outline the architecture of LayoutLMv3 and the setup (using the transformers (https://huggingface.co/docs/transformers/en/index) library, accessed on 12 July 2023) used to train the model head for the receipt-labelling token classification task. We also outline our hyperparameter tuning strategy.

5.1. LayoutLMv3 Overview

As outlined in our discussion on dataset requirements, for LayoutLM, we require two key pieces of information: a document image and the individual words in the document along with spatial information (the bounding box of each word in the image). The transformer input is a concatenation of two embeddings corresponding to two pieces of information, namely, text embeddings and image embeddings. Text embeddings are a combination of two embeddings for words and their positions. The word embeddings were initialized from a pre-trained RoBERTa model [40]. Position embeddings incorporate one-dimensional information (i.e., the index of the word in the text sequence) and two-dimensional information through bounding box coordinates. Image embeddings are obtained by rescaling the image (namely, to

H \times W

pixels) and representing it as a three-dimensional array of size

C \times H \times W

, where C is the number of channels in the original image. This representation is then split into “patches” of size

P \times P

, which are linearly projected in D dimensions and flattened into a sequence of vectors.

It is worth noting that, different from prior versions of LayoutLM and other existing multi-modal models, CNNs are not used for image embeddings. Instead, the authors of LayoutLMv3 opted to use linear embeddings, noting that such embeddings still achieve desirable results while being more performant than CNNs. The embeddings are then fed into a multi-modal transformer, for which its last layer outputs contextual information between the image and the text. This can be used for several downstream tasks. In addition, this transformer was pre-trained on three objectives: masked language modelling, masked image modelling, and word-patch alignment.

Since we used the pre-trained LayoutLMv3 available from the transformers library, we required each bounding box to be scaled to

1000 \times 1000

pixels for input into the LayoutLMv3Tokenizer encoder provided by the library. The images were automatically resized to

H \times W = 224 \times 224

pixels using the LayoutLMv3ImageProcessor encoder provided by the same library. In our case, the number of channels was

C = 3

, and the patch size was

P = 16

. We used the “base” version of LayoutLMv3, which, in particular, has a hidden size of

D = 768

(as opposed to

D = 1024

for the “large” variant). Furthermore, since the pre-trained LayoutLMv3 required tokens to be in rows of length 512, we used a combination of padding and striding to support receipts with more than 512 tokens. We then made use of the offset mappings returned by the LayoutLMv3Tokenizer encoder to associate the inferred labels with the original word-level tokens.

5.2. Hyperparameter Tuning

In all our experiments, the learning rate and (per device) training batch size chosen were established through a hyperparameter search using the optuna (https://optuna.org, accessed on 12 July 2023) suite of tools. The objectives set were as follows: (a) minimization of the evaluation loss, which in this case is the cross-entropy loss, and (b) maximization of the micro-F1 score. In all cases considered, the (per device) training batch size had little to no effect on the overall performance, whereas the learning rate had the largest effect.

6. Rule-Based Model

A rule-based model was implemented to operate either as a standalone inference model or to further improve and error-correct the multi-modal transformer’s classifications. When the model is used in conjunction with LayoutLM, the rule-based model has the capability of considering the transformer’s outputs before applying any rules. The rule-based model is made up of a series of regex-based pattern-matching rules, tabular-data feature extraction rules, rules based on database searches, and comparison-based error-correcting rules. The following sub-sections describe these features.

6.1. Pattern-Matching Rules

Pattern-matching rules form the core of the rule-based model’s ability to identify and classify useful information in printed receipts. These rules rely on regular expressions and logic to match specific text patterns and classify them into twelve predefined categories. The rules are flexible and adaptable, capable of handling various text formats and common OCR errors (such as the interpretation of “.” as “,” or “O” as “0” and vice-versa). The pattern-matching rules can also run standalone and do not consider pre-classified tokens. We describe the implemented pattern-matching rules below:

TOTAL: This identifies sequences of words in a line in a document that contains the total amount due, often found at the end or second half of receipts. The rule looks for keywords like TOTAL, AMOUNT DUE, or their variations, followed by a numeric value, which is typically the total amount. If such a pattern is detected, and no other works like SUBTOTAL, VAT, or NET are present in the line (e.g., so as not to match with VAT TOTAL or NET TOTAL), the rule labels the line as Total.
Example matches are TOTAL: $123.45; AMOUNT DUE: €678.90; and TOTAL $234.56.
DATE_TIME: This identifies dates and times in various formats within a document. It recognises month names in both the English and Maltese languages, including abbreviations (as well as common typos), and it supports a wide range of date and time formats, including the following: DD-MM-YYYY or D-M-YY, Month DD, YYYY, and HH:MM:SS AM.
When a match is found, the sequence of words in a line corresponding to the match is labelled as Date & time.
Example matches are 23-08-2024, 23 August 2024, and 12:45 PM.
VAT_INCLUDED: This identifies lines that mention VAT or tax amounts. It searches for patterns that indicate VAT rates or amounts, such as VAT TOTAL, TAX @, or RATE, followed by a percentage and an amount. When a match is found, the sequence of words in a line corresponding to the match is labelled as VAT included.
Example Matches are VAT @ 18%: $12.34 and TAX TOTAL: €5.67.
VAT_INFO: This identifies VAT numbers and related information in a document. It recognises patterns that include VAT registration numbers or other tax identification numbers, which are often preceded by keywords like VAT NO, VAT REGISTRATION, or EXO. Common OCR errors are accounted for as well, e.g., EX0 instead of EXO, MI instead of MT in the VAT number, etc. When a match is found, the sequence of words in a line corresponding to the match is labelled as VAT info.
Example matches are VAT NO: MT1234-5678, VAT REGISTRATION: 1234-5678, and EXO NO: 9876
TELEPHONE: This identifies telephone numbers within a document. It searches for keywords like TEL, PHONE, or MOBILE, followed by a numeric pattern that corresponds to a phone number. The rule also handles the specification of Malta’s country code in either one of two formats: +356 or 00356. Furthermore, the rule is capable of recognising multiple phone numbers listed in sequence. When a match is found, the sequence of words in a line corresponding to the match is labelled as Telephone.
Example matches are TEL: +356 8123 4567, PHONE NO: 00356-8123-4567, and MOBILE: 3999 9999/+356 8111 1111

6.2. Rules for the Classification of Named Entities

Names of shops and addresses consist mainly of proper names corresponding to entities. Therefore, a database approach was considered for labelling these categories. The general idea is to build a database (one for each class) that is then queried with the text to be classified. In the case where the query returns a match, the text is labelled as either (‘Name of shop’) or (‘Address’). The two databases were initiated from data available in the real-world-receipt training set. In addition, we experimented with location data obtained from publicly available post office data, on-the-fly queries to Google APIs, and web scraping, but these would incur either an additional fiscal or search cost. Bringing a human into the loop will also add entries to the database. One limitation of the database approach is the rate at which its size and search time grows.

6.3. Line-Item Extraction and Classification

Spatial and pattern-matching techniques were employed to extract line-item information from a receipt. This process involves identifying regions containing tabular data based on text alignment and spacing patterns, which typically represent the region of a receipt that includes line-item details.

The system recognizes tabular data by analyzing the horizontal spacing between words and grouping them into columns. This allows the identification of line-item sections, as illustrated in Figure 5, where each column typically represents specific information, such as item descriptions, quantities, prices, and total amounts.

Sometimes, there may be other non-line-item data presented in a tabular manner, such as VAT information. Therefore, a simple rule-based analysis is carried out to exclude lines containing certain keywords so that the identified region is restricted to the line items. Furthermore, on occasion, the line items span over two lines, with the line-item description on the first line and the price, quantity, and amount printed on the second. Therefore, the line-item segment of the receipt alternates between tabular and non-tabular data. This scenario is also handled to ensure the proper identification of the line-item regions.

Once the columns are identified, they are classified according to their content. Text-only columns are classified as item descriptions, whereas columns containing numerical data are classified as quantities, unit prices, or amounts. Thus, the classification process followed predefined rules that evaluated the structure and format of the data within each column.

6.4. Contextual Sequence Relabelling Rules

These rules work by evaluating the context of each word token in relation to its neighbouring word tokens on the same line, and they are applicable to classes where the content may consist of a number of word tokens. For example, a line of text pertaining to an address of a shop might be “52, Guzeppi Caruana str., Santa Lucia”, in which case we expect all six word tokens to be labelled as (‘Address’). However, some word tokens may have been mislabelled; therefore, these rules adjust the labels to maintain consistency across the sequence. For example, suppose that one of the words was misclassified by the classifier model as being part of the name of the shop (which is possible given that these two types of fields are typically close to each other), and we see the following sequence of pairs of labels and confidences: [(‘Address’, 0.95), (‘Name of shop’, 0.41), (‘Address’, 0.87), …]. We know that the name of the shop typically precedes the address, and we also see that we have a “name of shop” label with a low confidence score (i.e., <0.5) padded on either side by address labels having a high confidence score from the classifier. Therefore, it is reasonable to write an error-corrective rule that flips the “name of shop” label to an address label. A similar process is used to look for consistency among the (‘Name of shop’) and (‘Line Item’) classes, since these also exhibit the structure of a sequence of word tokens.

6.5. Joint Rule-Based LayoutLM Model

The LayoutLMv3 classifier model is executed in conjunction with the rule-based model according to the following pipeline:

Extract receipt image and execute OCR to obtain text and bounding boxes.
Predict labels for each bounding box and text from the LayoutLMv3 multi-modal classifier.
Execute the line-item processor:
(a)
Detect tabular layouts, organize bounding boxes and text into columns, and add labels on the basis of heuristic criteria.
(b)
Iterate over each bounding box and the corresponding LayoutLM label confidence:
If the label confidence is greater than the replace threshold (set to 0.6 in our experiments, slightly above the average scores obtained for rule-based line-item extraction), then skip to the next bounding box.
Else, if the existing label matches the new one, update the confidence to 1.0.
Otherwise, replace the label with the rule-derived one, and set label confidence to 0.5.
Execute the regex-based rules.
Since these rules are based on manually written regex expressions, matches are considered as correct, and hence, the result overwrites LayoutLM’s decision.
Execute the contextual sequence relabelling rules.
Return the labelled bounding boxes and text.

7. Results and Discussion

In total, seven different models were evaluated and compared for the receipt-labelling task. All models were evaluated using the same hold-out test set, which consisted of 147 real-world samples. The models are categorized and outlined below:

Rule-Based Model: The RULE model is based on the work presented in Section 6.
Layout Language Models: Three LyLM models, based on the work in Section 5, were considered: (a) LyLM-RW: classifier trained on the dataset of real-world thermally printed receipts; (b) LyLM-SA: classifier trained on a dataset of synthetically generated receipts; (c) LyLM-MX: classifier trained on a dataset consisting of a mixture of real-world and synthetically generated receipts.
Joint Models: The joint model first runs one of the classifier LyLM models, yielding a labelled receipt, and then, we use the RULE model to further improve classifications. This gives rise to three additional models: LyLM-RW + RULE, LyLM-SA + RULE, and LyLM-MX + RULE.

Our baseline LyLM-MX model achieved an overall F1 score of

0.952

. In comparison, the 133M parameter LayoutLMv3 base model in [1] achieved an F1 score of

0.966

on the CORD dataset (11,000 Indonesian receipts and five superclasses). The difference can be attributed to our much smaller dataset and larger number of classes.

From the F1 scores reported in Table 2, it is apparent that the overall best-performing model is the one using the LayoutLM model, for which its classifier is trained on a mixture of real-world and synthetic data, and it is jointly used with the rule-based model: LyLM-MX + RULE. The rule-based model improves the baseline LyLM-MX by almost 3%. We note that, similarly, the larger 368 parameter LayoutLMv3 model in [1] achieved an F1 score of 0.975 on the CORD dataset, a

1 %

increase over the smaller model. The LyLM-MX + RULE is, however, much less demanding on computational resources.

It is clear that the addition of synthetic data had only a very small positive effect on the overall performance of the joint LayoutLM-rule-based models (LyLM-RW + RULE and LyLM-MX + RULE). On the other hand, the addition of the rule-based model improves the LyLM-SA model, trained solely on the synthetic dataset, by approximately 10%.

Table 3 lists the per-label F1 scores for all models. It is apparent that training solely on synthetic data is not sufficient for labelling real-world receipts. It follows that the synthetic dataset requires either more variations in formats or added noise. On the other hand, when compared to LyLM-RW, the LyLM-MX model improves on certain classes (e.g., name of shop, address) that also fare relatively well in LyLM-SA. In addition, LyLM-MX, when compared to LyLM-RW, marginally degrades on other classes (e.g, telephone, VAT info) that do not perform well in LyLM-SA. These results seem to indicate that an update to the synthetic dataset can help improve performance on some other classes. Moreover, the addition of the rule-based model to the LyLM models improves mostly on “telephone” and “vat info” classes, followed by “Date and Time”. For some classes, like “total”, the same joint model fares marginally worse when compared to the respective LyLM model.

When the rule-based model is used on its own, the average F1 score is

0.569

. Overall, this is, by far, the worst model. However, this model is far more efficient in the use of memory and CPU time computational resources (see the end of this section below), and further improvements to this model can be very useful, especially for cases where only a subset of information is required. The development process of the rule-based model demonstrated the difficulty of extracting line items from thermally printed receipts, where information is formatted in various spatial layouts, presumably owing to the limited width of the paper. The per-label results for the rule-based model in Table 3 are a testament to this issue. The LyLM models were surprisingly effective at extracting individual row and column items.

Figure 6 depicts the confusion matrices for all classes for the LyLM-MX+RULE model. The true positive rate (TPR) (0.95 and 0.938) for line-item columns “unit price” and “amount” are lower than for the other classes, barring “VAT included”, which scores the lowest (0.925) TPR. An error analysis revealed that this could be due to the currency symbol being included in the line-item rows along with numerical “unit price” and “amount” values. Other classes that need improvement in the rule-based model are “total” (

F 1 = 0.781

), VAT included (

F 1 = 0.140

), and "name of shop" (

F 1 = 0.310

).

We now carry out a limited comparison (due to the use of different datasets) with related work discussed in Section 2). In [6], a rule-based model was developed and tested on 90 Swedish receipts. The F1 scores for this model are VENDOR =

0.455

(ours

F 1 = 0.310

), ADDRESS =

0.427

(ours

F 1 = 0.439

), DATE =

0.923

(ours

F 1 = 0.973

), TAX RATE =

0.697

(ours

F 1 = 0.140

), PRICE =

0.833

(ours

F 1 = 0.781

), and PRODUCTS =

0.127

(ours

F 1 = 0.447

, calculated as a weighted average across all four line-item fields). Our model performed significantly better in extracting PRODUCT data and marginally better in ADDRESS and DATE; however, it was marginally worse in extracting PRICE (total), significantly worse in the case of VENDOR name, and overly worse in TAX. Interestingly, our tax extraction rule improved the output of the LyLM-RW and LyLM-MX models but failed when the rule-based model is used on its own. With the addition of database-updating strategies, as discussed in Section 6, the scores for “address” and “name of shop” should increase. However, these classes can also benefit from pattern recognition rules (as in [6]) or, perhaps, from a lightweight named-entity-recognition model.

The receipts that make up the training dataset were predominately (

96 %

) in the English language. We, therefore, carried out a small-scale study on a limited number of receipts, 26 in total, which were roughly uniformly split across four European languages (Polish, Spanish, Italian, and German). As shown in Table 4, the average scores obtained (LyLM-MX + RULE model) are lower than those for the English language. Most notably, detecting VAT information is difficult across the board, possibly due to language and regional differences. Although present in most receipts, the model failed to detect text pertaining to the “VAT included” or “telephone” categories in Polish receipts. This is probably due to the way the receipt is structured. Detecting “total” in the examples of Italian receipts seems to be difficult, whilst scores for receipts in German are low on "name of shop”, “address”, and “telephone”. Polish receipts have the smallest macro-average score, and some Italian receipts have a combined “Quantity-Unit Price”, for example, “

2 \times 6.50

”, which our model fails to resolve, but they are correctly labelled in Polish receipts where there is a space in between the quantity and the “×” symbol. It is clear that examples in target languages need to be added in the training dataset.

Finally, we measured the CPU computational time and memory requirements for all methods. The measurements were carried out on an Apple M3-processor-based machine. All methods based on the LayoutLMv3 model took ≈480 ms for inferencing. As expected, the addition of the rule-based model did not substantially increase computational times. Once loaded, the transformer model occupies ≈480.61 MB. This is close to the theoretical value of 480 MB, assuming 125.9 M parameters and float32 numbers. The memory requirement increases by ≈400 MB to a total of ≈900 MB due to activations, the image, and some other overheads. The rule-based model, excluding the Address and Name of Shop database searches, takes ≈1.1 ms and increases to ≈4.3 ms when included. The latter is of particular concern since the database’s size can keep growing, and other rule-based or model-based approaches should be considered, as outlined above. The memory size occupied by the rule-based model is <0.3 MB. These results are interesting from the point of deployment. As it stands and with an improvement in extracting the Total and the Name of Shop field, the rule-based model can be used when the minimal subset of information is needed, i.e., supplier’s name, date, and total amount. In this case, an improved rule-based model offers a computational advantage over the neural network transformer model.

8. Conclusions and Limitations

From the results reported in Section 7, we conclude that joint models (neural network plus rule-based) shows promise in matching the performances of larger neural network models in receipt information extraction tasks. The addition of the rule-based model improved the neural network by almost

3 %

to a final F1 score of

0.98

. The neural network model (LyLM-RW), on its own, achieved a score of

F 1 = 0.952 %

, averaged over all 12 classes, with relatively few real-world samples.

Despite the promising results, the study carried out on receipts in other languages showed that there are language and jurisdictional/regulatory limitations. Nonetheless, owing to the relatively small real-world sample size required to achieve respectable accuracy from the LyLM models, we believe that the effort required to support new languages and regions is reasonable, since it depends on collecting data and retraining the model’s classification head. Furthermore, it is envisioned that the model will eventually be used in a product where end-users have the option to manually correct annotated receipts, which can then be stored and used to periodically fine-tune the LyLM model. Therefore, LyLM can progressively learn receipts in new languages and regions.

However, additional manual effort is required to add language and regional support to the rule-based model, since rules operating on some classes, for example, telephone number or tax information, would have to be hand-crafted depending on the various formats used in a language or geographical region. Future work must also include further development on the tabular-data feature extraction pipeline of the rule-based model (which is language-agnostic) and on the “total” and “VAT amount” fields. With improvements to the rule-based model’s rules and to the way it is combined with the neural network classifier model, it may be possible to use a lower-complexity neural network, thus reducing demand on the computing resources.

Author Contributions

Conceptualization, A.M. and G.V.; methodology, L.G., X.M., L.K. and A.M.; software, L.G., X.M. and L.K.; validation, L.G., X.M. and L.K.; investigation, L.G. and X.M.; resources, A.B., L.G., L.K. and X.M.; data curation, A.B., L.K., L.G. and X.M.; writing—original draft preparation, X.M., L.G. and A.B.; writing—review and editing, L.K., A.M. and G.V.; supervision, G.V. and A.M.; project administration, A.M.; funding acquisition, A.M. and G.V. All authors have read and agreed to the published version of this manuscript.

Funding

This research was funded by Xjenza Malta (formerly Malta Council for Science and Technology) FUSION: Technology Development Programme, grant number R&I-2018-026-T.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The dataset presented in this article is not readily available because it includes non-redacted information on receipts.

Acknowledgments

The authors would like to thank Dylan Galea and Said Boumaraf, who helped in the early preparation and analysis of the data.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence;
BERT	Bidirectional Encoder Representations from Transformers;
CNN	Convolutional Neural Network;
CORD	Consolidated Receipt Dataset;
CPU	Central Processing Unit;
GPU	Graphical processing Unit;
ITM	Image–Text Matching;
LLM	Large Language Model;
LyLM	Layout Language Model;
LSTM	Long Short Term Memory;
MLM	Masked Language Model;
MVLM	Masked Visual-Language Model;
OCR	Optical Character Recognition;
RNN	Recurrent Neural Network;
SROIE	Scanned Receipts Optical Character Recognition and Information Extraction;
TPR	True Positive Rate;
VAT	Value-Added Tax;
WPA	Word-Patch Alignment.

References

Huang, Y.; Lv, T.; Cui, L.; Lu, Y.; Wei, F. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. arXiv 2022, arXiv:2204.08387. [Google Scholar]
Huang, Z.; Chen, K.; He, J.; Bai, X.; Karatzas, D.; Lu, S.; Jawahar, C.V. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 20–25 September 2019; pp. 1516–1520. [Google Scholar] [CrossRef]
Huang, Z.; Chen, K.; He, J.; Bai, X.; Karatzas, D.; Lu, S.; Jawahar, C.V. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. arXiv 2021, arXiv:2103.10213. [Google Scholar] [CrossRef]
Park, S.; Shin, S.; Lee, B.; Lee, J.; Surh, J.; Seo, M.; Lee, H. CORD: A Consolidated Receipt Dataset for Post-OCR Parsing. In Proceedings of the Workshop on Document Intelligence at NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Yue, A. Automated Receipt Image Identification, Cropping, and Parsing. 2018. Available online: https://api.semanticscholar.org/CorpusID:49555566 (accessed on 13 June 2023).
Lazic, M. Using Natural Language Processing to Extract Information from Receipt Text. Master’s Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2020. Available online: https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-279302 (accessed on 13 June 2023).
Palm, R.B.; Winther, O.; Laws, F. CloudScan—A configuration-free invoice analysis system using recurrent neural networks. arXiv 2017, arXiv:1708.07403. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Xu, Y.; Li, M.; Cui, L.; Huang, S.; Wei, F.; Zhou, M. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 1192–1200. [Google Scholar] [CrossRef]
Xu, Y.; Xu, Y.; Lv, T.; Cui, L.; Wei, F.; Wang, G.; Lu, Y.; Florencio, D.; Zhang, C.; Che, W.; et al. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual Event, 1–6 August 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 2579–2591. [Google Scholar] [CrossRef]
Garncarek, Ł.; Powalski, R.; Stanisławek, T.; Topolski, B.; Halama, P.; Turski, M.; Graliński, F. LAMBERT: Layout-Aware Language Modeling for Information Extraction. In Document Analysis and Recognition—ICDAR 2021; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 532–547. [Google Scholar] [CrossRef]
Zhang, Z.; Ma, J.; Du, J.; Wang, L.; Zhang, J. Multimodal Pre-training Based on Graph Attention Network for Document Understanding. arXiv 2022, arXiv:2203.13530. [Google Scholar] [CrossRef]
Abdallah, A.; Abdalla, M.; Elkasaby, M.; Elbendary, Y.; Jatowt, A. AMuRD: Annotated Arabic-English Receipt Dataset for Key Information Extraction and Classification. arXiv 2024, arXiv:2309.09800. [Google Scholar]
Townsend, B.; May, M.; Mackowiak, K.; Wells, C. RealKIE: Five Novel Datasets for Enterprise Key Information Extraction. arXiv 2025, arXiv:2403.20101. [Google Scholar]
Wang, D.; Ma, Z.; Nourbakhsh, A.; Gu, K.; Shah, S. DocGraphLM: Documental Graph Language Model for Information Extraction. arXiv 2024, arXiv:2401.02823. [Google Scholar]
Yan, Z.; Ye, Z.; Ge, J.; Qin, J.; Liu, J.; Cheng, Y.; Gurrin, C. DocExtractNet: A novel framework for enhanced information extraction from business documents. Inf. Process. Manag. 2025, 62, 104046. [Google Scholar] [CrossRef]
Yu, J.M.; Ma, H.J.; Kong, J.L. Receipt Recognition Technology Driven by Multimodal Alignment and Lightweight Sequence Modeling. Electronics 2025, 14, 1717. [Google Scholar] [CrossRef]
Berghaus, D.; Berger, A.; Hillebrand, L.; Cvejoski, K.; Sifa, R. Multi-Modal Vision vs. Text-Based Parsing: Benchmarking LLM Strategies for Invoice Processing. arXiv 2025, arXiv:2509.04469. [Google Scholar]
Kim, G.; Hong, T.; Yim, M.; Nam, J.; Park, J.; Yim, J.; Hwang, W.; Yun, S.; Han, D.; Park, S. OCR-Free Document Understanding Transformer. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXVIII. Springer: Berlin/Heidelberg, Germany, 2022; pp. 498–517. [Google Scholar] [CrossRef]
Smith, R. An Overview of the Tesseract OCR Engine. In Proceedings of the ICDAR ’07: Proceedings of the Ninth International Conference on Document Analysis and Recognition, Curitiba, Brazil, 23–26 September 2007; pp. 629–633. [Google Scholar]
Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chrasekhar, V.R.; Lu, S.; et al. ICDAR 2015 competition on robust reading. In Proceedings of the ICDAR, Tunis, Tunisia, 23–26 August 2015. [Google Scholar]
JaidedAI: EasyOCR. 2023. Available online: https://github.com/JaidedAI/EasyOCR (accessed on 7 November 2024).
Mindee. docTR: Document Text Recognition. 2021. Available online: https://github.com/mindee/doctr (accessed on 7 November 2024).
Shi, B.; Bai, X.; Belongie, S. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2298–2304. [Google Scholar] [CrossRef]
Busta, M.; Neumann, L.; Matas, J. Deep TextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Yao, C.; Bai, X.; Liu, W.; Ma, Y.; Tu, Z. Detecting texts of arbitrary orientations in natural images. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012. [Google Scholar]
Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. EAST: An Efficient and Accurate Scene Text Detector. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Liao, M.; Shi, B.; Bai, X. TextBoxes++: A Single-Shot Oriented Scene Text Detector. IEEE Trans. Image Process. 2018, 27, 3676–3690. [Google Scholar] [CrossRef]
Cheng, J.; Yang, C.; Chen, M.; Yuan, Y.; Wang, Q. Text Growing on Leaf. IEEE Trans. Multimed. 2023, 25, 9029–9043. [Google Scholar] [CrossRef]
Yang, C.; Chen, M.; Yuan, Y.; Wang, Q. Zoom Text Detector. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 15745–15757. [Google Scholar] [CrossRef] [PubMed]
Yang, C.; Chen, M.; Xiong, Z.; Yuan, Y.; Wang, Q. CM-Net: Concentric Mask Based Arbitrary-Shaped Text Detection. IEEE Trans. Image Process. 2022, 31, 2864–2877. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Su, Y.; Tripathi, S.; Tu, Z. Text Spotting Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
Huang, M.; Liu, Y.; Peng, Z.; Liu, C.; Lin, D.; Zhu, S.; Yuan, N.; Ding, K.; Jin, L. SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
Long, S.; He, X.; Yao, C. Scene Text Detection and Recognition: The Deep Learning Era. Int. J. Comput. Vis. 2021, 129, 161–184. [Google Scholar] [CrossRef]
Wang, W.-F.; He, Z.-H.; Wang, K.; Wang, Y.-F.; Zou, L.; Wu, Z.-Z. A survey of text detection and recognition algorithms based on deep learning technology. Neurocomputing 2023, 556, 126702. [Google Scholar] [CrossRef]
Ribeiro, M.R.M.; Jùlio, D.; Abelha, V.; Abelha, A.; Machado, J. A Comparative Study of Optical Character Recognition in Health Information System. In Proceedings of the 2019 International Conference in Engineering Applications (ICEA), Sao Miguel, Portugal, 8–11 July 2019; pp. 1–5. [Google Scholar] [CrossRef]
Batra, P.; Phalnikar, N.; Kurmi, D.; Tembhurne, J.; Sahare, P.; Diwan, T. OCR-MRD: Performance analysis of different optical character recognition engines for medical report digitization. Int. J. Inf. Technol. 2024, 16, 447–455. [Google Scholar] [CrossRef]
Hemmer, A.; Coustaty, M.; Bartolo, N.; Ogier, J.M. Confidence-Aware Document OCR Error Detection. In International Workshop on Document Analysis Systems; Sfikas, G., Retsinas, G., Eds.; Springer: Cham, Switzerland, 2024; pp. 213–228. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]

Figure 1. Architecture/pipeline of the receipt information extraction system. The LayoutLMv3 model takes bounding boxes and the corresponding enclosed text obtained from an OCR system and the receipt image as input. The classifier head labels text segments with 1 out of 12 classes, and the manually defined rule-based model corrects any perceived errors in the LayoutLM output on the basis of relative location of the text segments on the receipt and confidence scores.

Figure 2. Creation of annotations by combining manually created labels, bounding boxes, and the OCR output.

Figure 3. An example of a real-world receipt from the training set and parts of the automatically generated and manually curated annotations in json format.

Figure 4. Examples from real-world-receipt datasets. Information identifying persons or entities are redacted in this illustration but not in the dataset.

Figure 5. Example of the line-item region, i.e., tabular identification (bounding boxes in magenta colour) on a receipt, where some of the information is redacted. Conversely the bounding boxes in blue are not detected as tabular regions.

Figure 6. Per-label confusion matrix for the LyLM-MX + RULE model.

Table 1. Comparison of EasyOCR, PyTessaract, and DocTR on our receipts dataset: best scores in bold. CER—Character error rate; WER—word error rate; Avg. Lev.—average Levenshtein; ↓—lower better; ↑—higher better.

Metric	Avg. Lev. ↓	CER ↓	WER ↓	Char. Acc. ↑	Word Acc. ↑	Macro Prec. ↑	Macro Recall ↑	Macro F1 ↑
EasyOCR	1.1525	0.1369	0.463	0.8631	0.537	0.6338	0.6412	0.6367
Pytesseract	0.4703	0.1072	0.1959	0.8928	0.8041	0.8098	0.8098	0.8098
DocTR	0.0696	0.0147	0.0426	0.9853	0.9574	0.9585	0.9574	0.9574

Table 2. Micro-, macro-, and weighted-average F1 scores for all models: best scores in bold.

Model	F1 (Micro)	F1 (Macro)	F1 (Weighted Average)
RULE	0.550	0.573	0.569
LyLM-RW	0.956	0.913	0.952
LyLM-RW + RULE	0.979	0.963	0.979
LyLM-SA	0.749	0.475	0.743
LyLM-SA + RULE	0.807	0.643	0.813
LyLM-MX	0.957	0.906	0.953
LyLM-MX + RULE	0.980	0.956	0.980

Table 3. Per-label F1 scores for all models: best scores in bold.

Label	(Support)	RULE	LyLM-RW	LyLM-RW + RULE	LyLM-SA	LyLM-SA + RULE	LyLM-MX	LyLM-MX + RULE
Name of shop	(350)	0.310	0.941	0.941	0.714	0.714	0.956	0.956
Address	(882)	0.439	0.974	0.972	0.793	0.801	0.983	0.984
Telephone	(209)	0.870	0.739	0.964	0.575	0.929	0.737	0.962
VAT info	(427)	0.960	0.661	0.961	0.352	0.718	0.659	0.962
Date and time	(430)	0.973	0.912	0.974	0.387	0.940	0.917	0.976
Line items	(1840)	0.419	0.978	0.978	0.813	0.808	0.982	0.985
Unit price	(180)	0.404	0.969	0.969	0.022	0.224	0.927	0.929
Total	(147)	0.781	0.945	0.924	0.164	0.686	0.933	0.916
VAT included	(80)	0.140	0.927	0.933	0.241	0.211	0.876	0.892
Misc	(7392)	0.590	0.967	0.986	0.851	0.887	0.968	0.987
Quantity	(273)	0.407	0.975	0.978	0.719	0.663	0.975	0.973
Amount	(516)	0.584	0.971	0.971	0.071	0.207	0.954	0.956

Table 4. Per-label F1 scores (LyLM-MX + RULE model) for four non-English languages.

Language	Polish	Italian	Spanish	German
Name of shop	0.571	0.522	0.857	0.571
Address	0.884	0.724	1.000	0.582
Telephone	0.000	0.889	0.727	0.600
VAT info	0.000	0.471	0.000	-
Date and Time	0.857	1.000	1.000	0.930
Total	0.857	0.500	0.875	0.833
VAT included	0.000	0.889	0.833	1.000
Line items	0.706	0.741	0.984	0.976
Unit price	0.860	0.400	1.000	1.000
Quantity	0.883	0.400	0.968	0.800
Amount	0.680	0.828	1.000	1.000
Misc	0.732	0.917	0.980	0.940
Weighted average	0.742	0.839	0.970	0.902
Macro-average	0.586	0.690	0.852	0.769

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mifsud, X.; Grech, L.; Baldacchino, A.; Keller, L.; Valentino, G.; Muscat, A. Receipt Information Extraction with Joint Multi-Modal Transformer and Rule-Based Model. Mach. Learn. Knowl. Extr. 2025, 7, 167. https://doi.org/10.3390/make7040167

AMA Style

Mifsud X, Grech L, Baldacchino A, Keller L, Valentino G, Muscat A. Receipt Information Extraction with Joint Multi-Modal Transformer and Rule-Based Model. Machine Learning and Knowledge Extraction. 2025; 7(4):167. https://doi.org/10.3390/make7040167

Chicago/Turabian Style

Mifsud, Xandru, Leander Grech, Adriana Baldacchino, Léa Keller, Gianluca Valentino, and Adrian Muscat. 2025. "Receipt Information Extraction with Joint Multi-Modal Transformer and Rule-Based Model" Machine Learning and Knowledge Extraction 7, no. 4: 167. https://doi.org/10.3390/make7040167

APA Style

Mifsud, X., Grech, L., Baldacchino, A., Keller, L., Valentino, G., & Muscat, A. (2025). Receipt Information Extraction with Joint Multi-Modal Transformer and Rule-Based Model. Machine Learning and Knowledge Extraction, 7(4), 167. https://doi.org/10.3390/make7040167

Article Menu

Receipt Information Extraction with Joint Multi-Modal Transformer and Rule-Based Model

Abstract

1. Introduction

2. Related Work

3. Dataset Collection, Curation, and Maintenance

3.1. Dataset Overview

3.1.1. Real-World Dataset

3.1.2. Synthetic Dataset Generation

4. Optical Character Recognition Models

4.1. EasyOCR, PyTessaract, and DocTR Models

4.2. Comparison of OCR Models

5. Token Classification Using LayoutLMv3

5.1. LayoutLMv3 Overview

5.2. Hyperparameter Tuning

6. Rule-Based Model

6.1. Pattern-Matching Rules

6.2. Rules for the Classification of Named Entities

6.3. Line-Item Extraction and Classification

6.4. Contextual Sequence Relabelling Rules

6.5. Joint Rule-Based LayoutLM Model

7. Results and Discussion

8. Conclusions and Limitations

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI