Next Article in Journal
Optimization Techniques for Improving Economic Profitability Through Supply Chain Processes: A Systematic Literature Review
Previous Article in Journal
Modeling DECT-2020 as a Tandem Queueing System and Its Application to the Peak Age of Information Analysis
Previous Article in Special Issue
MSLCP-DETR: A Multi-Scale Linear Attention and Sparse Fusion Framework for Infrared Small Target Detection in Vehicle-Mounted Systems
error_outline You can access the new MDPI.com website here. Explore and share your feedback with us.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

KORIE: A Multi-Task Benchmark for Detection, OCR, and Information Extraction on Korean Retail Receipts

by
Mahmoud SalahEldin Kasem
1,2,
Mohamed Mahmoud
1,3,
Mostafa Farouk Senussi
1,3,
Mahmoud Abdalla
1 and
Hyun Soo Kang
1,*
1
Department of Information and Communication Engineering, School of Electrical and Computer Engineering, Chungbuk National University, Cheongju-si 28644, Republic of Korea
2
Multimedia Department, Faculty of Computers and Information, Assiut University, Assiut 71526, Egypt
3
Information Technology Department, Faculty of Computers and Information, Assiut University, Assiut 71526, Egypt
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(1), 187; https://doi.org/10.3390/math14010187
Submission received: 8 December 2025 / Revised: 26 December 2025 / Accepted: 29 December 2025 / Published: 4 January 2026

Abstract

We introduce KORIE, a curated benchmark of 748 Korean retail receipts designed to evaluate scene text detection, Optical Character Recognition (OCR), and Information Extraction (IE) under challenging digitization conditions. Unlike existing large-scale repositories, KORIE consists exclusively of receipts digitized via flatbed scanning (HP LaserJet MFP), specifically selected to preserve complex thermal printing artifacts such as ink fading, banding, and mechanical creases. We establish rigorous baselines across three tasks: (1) Detection, comparing Weakly Supervised Object Localization (WSOL) against state-of-the-art fully supervised models (YOLOv9, YOLOv10, YOLOv11, and DINO-DETR); (2) OCR, benchmarking Tesseract, EasyOCR, PaddleOCR, and a custom Attention-based BiGRU; and (3) Information Extraction, evaluating the zero-shot capabilities of Large Language Models (Llama-3, Qwen-2.5) on structured field parsing. Our results identify YOLOv11 as the optimal detector for dense receipt layouts and demonstrate that while PaddleOCR achieves the lowest Character Error Rate (15.84%), standard LLMs struggle in zero-shot settings due to domain mismatch with noisy Korean receipt text, particularly for price-related fields (F1 scores ≈ 25%). We release the dataset, splits, and evaluation code to facilitate reproducible research on degraded Hangul document understanding.

1. Introduction

Retail receipts are compact, visually noisy documents that mix dense, small text with variable layouts, and real-world thermal printing artifacts. In Korean receipts, these difficulties are amplified by Hangul syllable blocks, frequent code-switching with Latin characters and numerals, and domain conventions such as currency symbols and vendor-specific formatting. Despite substantial progress in scene text detection and optical character recognition (OCR), Korean retail receipts remain underrepresented in public benchmarks, limiting reproducible evaluation of both visual front-end quality and downstream information extraction.
Across computer vision and adjacent domains, deep learning has advanced through both architectural innovation and practical training strategies. Work on activation functions and optimization has improved gradient flow and convergence in deep convolutional networks (CNNs) [1], while transfer learning from large-scale pretraining has become a default lever for data-scarce tasks [2,3,4]. Similar methodologies drive impact in medical imaging (diagnosis, detection, and classification) [5,6,7,8] and natural language-assisted systems [9,10,11,12]; in applied machine learning [13,14]; customer segmentation [15,16]; intrusion detection for IoT [17,18]; and removal of occlusions [19]. These gains, enabled by powerful DCNNs, scalable GPUs, and improved training curricula, yield strong performance in generic text detection and recognition. However, fine-grained document scenarios—such as receipts—remain challenging due to dense, small text; mixed scripts; layout irregularities; and the need to map raw text into structured key–value fields and item records.
Public datasets have shaped research on document images, but each covers a different slice of the problem. SROIE standardized text localization, recognition, and key information extraction on scanned English receipts [20], catalyzing evaluation but with relatively clean images and a narrow linguistic scope. CORD broadened coverage with non-English (Indonesian) receipts and richer annotations for post-OCR parsing [21]. Multilingual form datasets such as XFUND target key–value understanding across languages but focus on forms rather than retail receipts [22]. More recently, mobile-captured Japanese receipt corpora emphasize in-the-wild imaging conditions and mixed scripts [23], while AMuRD introduces a large multilingual (Arabic/English) collection oriented toward downstream extraction and item classification [24]. Together, these resources advance linguistic diversity, realism, and information extraction, yet there is still no focused, Korean-language benchmark that jointly supports text detection, OCR, and structured receipt understanding.
Against this backdrop, retail receipts are a particularly exacting testbed for end-to-end receipt understanding: the physical degradation of thermal paper yields low-contrast strokes, significant ink fading, and printer head banding, while mechanical handling introduces creases that disrupt dense layouts interleaving small Hangul, Latin characters, and numerals in tight columns [20,21]. To rigorously measure progress across the stages of a receipt pipeline, we present KORIE, a Korean receipts benchmark that spans text detection, OCR, and information extraction under realistic digitization conditions. We provide bounding-box annotations and transcriptions and benchmark diverse detectors (YOLOv9/10/11, DINO 4/5-scale with ResNet-50, Swin-L, and a WSOL-style baseline) [25,26,27,28,29] paired with OCR engines (Tesseract, pytesseract, EasyOCR, and a CRNN/attention-style BiGRU recognizer) [30,31,32,33,34]. Using standardized metrics—mAP for detection and CER/WER for OCR—we report results both on ground-truth boxes and on detector crops, enabling fair, reproducible comparisons and clear attribution of error to localization versus recognition [35]. On top of these, we define an information extraction track over key–value fields and item-level records, allowing us to study how visual errors propagate to structured outputs.
KORIE comprises 748 Korean retail receipts captured under real-world conditions (thermal banding, folds, skew, blur, glare, compression). It provides bounding-box annotations and transcriptions to quantify performance on detection and OCR, and additionally offers structured annotations for key information extraction. From these receipts, we derive 17,587 box-level text images for OCR experiments and 2886 labeled entities and item records for information extraction, covering merchant names, dates, receipt numbers, item descriptions, quantities, prices, and totals. We evaluate YOLOv9/10/11, DINO (4/5-scale, ResNet-50), Swin-L, and a WSOL baseline alongside OCR engines Tesseract, pytesseract, EasyOCR, and a neural Attention-Gated BiGRU. We report mAP for detection and CER/WER for OCR in two settings—(1) OCR on ground-truth boxes (to isolate recognition quality) and (2) OCR on detector crops (to expose compounding errors)—and use the resulting transcriptions as inputs to the information extraction task.
While KORIE enables end-to-end evaluation, it is designed to keep the stages of detection, OCR, and information extraction analytically separable. By providing fine-grained box-level and field-level annotations, the benchmark allows practitioners to measure progress on the core visual tasks that underpin receipt pipelines and to disentangle visual failures from semantic extraction errors. Our analyses highlight persistent challenges specific to Korean receipts—small-font Hangul recognition, jamo fragmentation under blur, confusions among visually similar glyphs (e.g., 1/|/I), and their downstream impact on key fields such as totals and dates—providing a clear target for future model and augmentation design. We release dataset splits, OCR crops, information extraction labels, and evaluation scripts to facilitate reproducible, apples-to-apples comparisons on this important yet under-served domain.
Our work makes the following contributions:
  • We introduce KORIE, a Korean-language receipt dataset comprising 748 scanned thermal receipts with bounding-box annotations and transcriptions, specifically designed to stress dense, small text; mixed scripts; and real-world imaging artifacts.
  • We define a multi-task benchmark covering three foundational stages of receipt understanding—text detection, OCR, and information extraction. We report mAP for detection, CER/WER for recognition under two settings (OCR on ground-truth boxes and on detector crops), and field-level metrics for extracting key–value fields and item records, enabling clear attribution of error across the pipeline.
  • We provide a large-scale OCR subset with 17,587 box-level text images and an information extraction subset with 2886 annotated entities and item lines, covering merchant names, dates, receipt numbers, line items, quantities, prices, and totals, thereby supporting both visual and semantic evaluation.
  • We benchmark a diverse suite of detectors (YOLOv9/10/11, DINO 4/5-scale with ResNet-50, Swin-L, and WSOL-style baselines) and OCR engines (Tesseract, pytesseract, EasyOCR, and an attention-based BiGRU recognizer), and analyze failure modes specific to Korean receipts—such as small-font Hangul degradation, jamo fragmentation, and visually similar glyph confusions—releasing dataset splits and code to facilitate future work on Hangul-aware detection, recognition, and extraction under realistic acquisition conditions.

2. Related Work

2.1. Receipt and Document Datasets

Public datasets have catalyzed progress in document text detection, OCR, and information extraction, but receipt-specific, non-English resources remain limited. SROIE standardized evaluation on scanned English receipts with tasks for text localization, recognition, and key information extraction [20]. CORD broadened coverage to Indonesian receipts and coupled OCR labels with richer hierarchical annotations aimed at post-OCR parsing and entity-level understanding [21]. WildReceipt provided 1768 in-the-wild English receipts with 25 semantic classes, emphasizing diversity in capture conditions and layouts for both detection and field extraction [36]. Beyond receipts, FUNSD and XFUND target form understanding with entity/relation labels and multimodal signals, but focus on forms rather than retail receipts [22,37]. Large-scale scene-text corpora such as COCO-Text, ICDAR MLT, RCTW-17, ReCTS, and TextOCR have improved robustness for multilingual detection/recognition, yet they are not receipt-centric in structure or artifact distribution [38,39,40,41]. Recent efforts move closer to real deployment conditions: mobile-captured Japanese receipt datasets emphasize mixed scripts and handheld imaging [23], while AMuRD introduces a large Arabic–English dataset oriented to downstream extraction and item classification [24]. Compared with these, there is still a gap for a Korean-language receipt corpus that jointly supports text detection, OCR, and structured information extraction.

2.2. Text Detection for Documents and Receipts

Classical pipelines combined binarization, connected components, and heuristic grouping with OCR engines, but struggled under non-uniform illumination, skew, and clutter. Deep detectors improved robustness to complex backgrounds and small text. Regression-based and segmentation-style methods such as CTPN, EAST, PSENet, DB/DB++, and PAN advanced recall and boundary quality for curved, rotated, or densely packed text [42,43,44,45,46]. Transformer-based detectors (e.g., Deformable-DETR variants) extended receptive fields and multi-scale reasoning; DINO introduced denoising anchors and improved convergence, benefiting small-object scenarios that receipts frequently exhibit [28,47]. In parallel, one-stage general-purpose detectors (YOLO family) remain attractive for receipt settings due to speed and deployment simplicity; recent generations (v9/v10/v11) emphasize better label assignment, auxiliary losses, and latency–accuracy trade-offs that transfer well to dense, small text after domain finetuning [25,26,27]. Backbones like Swin-L yield hierarchical features well-suited to small glyphs and long-range context in narrow receipts [48]. Weakly supervised object localization (WSOL) via CAM/score maps provides a low-annotation-cost alternative, though it typically underperforms compared to fully supervised detectors on tightly packed text [29]. Our baseline suite spans these families (YOLO, DINO with ResNet-50, Swin-L backbones, and WSOL-style baselines) to reflect common deployment choices and research trends.

2.3. OCR for Multilingual and CJK Text

Recognition has progressed from lexicon-driven engines to neural sequence models. Tesseract introduced an open LSTM-based recognizer that remains a strong baseline with appropriate language packs and preprocessing [30]. End-to-end neural recognizers such as CRNN (CTC-based) and attention-decoder models improved robustness to irregular spacing and variable-length text lines [33,34]. Subsequent developments incorporate transformer decoders, language modeling, and iterative refinement (e.g., ABINet/PARSeq-style designs), further reducing errors from character confusions and spacing [49]. For Korean and other CJK scripts, challenges include small font sizes, low-contrast strokes from thermal printing, and jamo-level fragmentation under blur or compression; domain-specific normalization and training on script-balanced corpora remain important. Toolkits such as EasyOCR and wrappers like pytesseract are widely used in practice and provide competitive baselines with minimal engineering [31,32]. In this work we benchmark classical (Tesseract/pytesseract) and toolkit-based (EasyOCR) engines alongside a neural attention-based BiGRU recognizer trained on receipt crops to quantify recognition difficulty under controlled conditions.

2.4. End-to-End, OCR-Free, and Information Extraction Trends

Unified, OCR-free models (e.g., encoder–decoder architectures that map images directly to structured text) and multimodal pretraining (LayoutLM-style, vision–language models) have shown promise for downstream understanding, but typically require significant compute and large-scale annotations, and they often conflate localization, recognition, and parsing in evaluation [50,51]. Receipt-focused pipelines frequently retain a modular design, with separate detection, OCR, and information extraction stages, both for engineering flexibility and for diagnostic evaluation. As a result, many works still report decoupled detection and OCR metrics (mAP; CER/WER) alongside field-level precision/recall/F1 for key–value extraction, and study the “coupling gap’’ when plugging detector crops and imperfect OCR into extraction models. Our benchmark follows this practice to provide reproducible, apples-to-apples comparisons and clear attribution of error to localization, recognition, and extraction.

2.5. Positioning

Relative to prior datasets and methods, KORIE contributes a Korean-language, scanned receipt benchmark that spans detection, OCR, and information extraction while keeping these stages analytically separable. It covers a spectrum of modern detectors (one-stage and transformer-based) and recognizers (classical, toolkit, and neural), and provides structured labels for key–value fields and item lines. This fills a documented gap in receipt-focused, non-English evaluation and provides a compact yet challenging testbed for studying small-text detection, Hangul-aware recognition, and structured receipt understanding under realistic acquisition artifacts.

3. Dataset

3.1. Ethical Considerations and Data Collection

All receipts in KORIE were collected by the authors from restaurants, supermarkets, and retail stores across South Korea. The dataset combines two acquisition modes. The initial portion of the corpus was digitized using an HP Color LaserJet MFP scanner at 300 dpi; although scanned, these receipts already exhibit substantial real-world degradation inherited from the thermal paper itself—ink fading, head band streaking, friction wear, partial creasing, and locally washed-out characters. These artifacts originate from the physical condition of the receipt rather than the digitization process and therefore remain visible even under flatbed scanning. Figure 1 illustrates some examples of the scanned dataset. In addition, KORIE includes a growing subset of mobile-captured receipts, which introduce genuine in-the-wild imaging variations such as blur, shadows, glare, perspective skew, and uneven illumination. Figure 2 illustrates some examples of the scanned dataset. The dataset is designed as a hybrid and continuously expanding benchmark, and all references to “challenging” or “real-world” conditions pertain to the physical thermal artifacts and the mobile-captured portion of the corpus.
To protect privacy, we adopted a four-step de-identification pipeline before any annotation took place. First, annotators conducted a line-by-line review of each receipt image to identify potentially sensitive content (e.g., customer identifiers, full card numbers, or loyalty IDs). Second, such fields were manually obscured or removed at the image level. Third, a second annotator verified that no obvious personal identifiers remained. Finally, an independent cross-check was performed on a random subset of receipts to ensure adherence to the privacy guidelines. Only de-identified images were admitted into the final corpus and used for annotation or experiments.

3.2. Annotation Protocol

We established detailed annotation guidelines tailored to Korean retail receipts, covering merchant names, item descriptions, prices, dates, and other transactional fields. The guidelines included specialized protocols for South Korean conventions, such as local date formats, currency notation in Korean won, and mixed-script merchant names that combine Hangul and Latin characters. Ambiguous cases (e.g., unconventional abbreviations or visually degraded text) were documented in a shared guideline and resolved through consensus among annotators.
All receipts were annotated using the MakeSense platform. Annotators drew bounding boxes around textual regions and exported them in both YOLO and COCO formats to support a wide range of detection pipelines. The YOLO annotations follow the standard ( x center , y center , w , h ) normalized representation, while the COCO annotations store ( x , y , w , h ) in absolute pixel units. For OCR tasks, we developed a custom annotation system that maintains the positional integrity and reading order of Korean text, ensuring that line- and region-level transcriptions can be reliably associated with their corresponding bounding boxes.
Beyond generic text regions, we also defined item-specific and field-specific annotations to capture semantics that are particularly relevant for retail receipts. For each item line, annotators labeled the item name, a coarse item classification, quantity, unit or packaging information, and unit/total prices; brand names were recorded when explicitly present on the receipt. At the header and footer levels, we annotated key fields such as merchant name, merchant address, merchant phone number, receipt number, dates and times, subtotals, taxes, and overall totals. These richer annotations were iteratively refined through feedback loops with domain experts familiar with Korean retail formats and form the basis of the information extraction track in KORIE.

3.3. Statistics and Splits

The KORIE dataset comprises 748 Korean retail receipts. Each receipt typically contains a dense mix of small-font Hangul, Latin characters, and numerals, along with store headers, line-item lists, subtotals/totals, tax information, and footer messages. Real-world artifacts such as folds, skew, blur, glare, and thermal banding are prevalent due to in-the-wild capture and aging of the printed receipts.
For OCR-focused experiments, we extracted text-region crops from the annotated receipts, yielding a total of 17,587 word-level images used as OCR instances. Each crop is paired with its transcription and inherits its spatial coordinates from the underlying detection annotations, enabling evaluation both in a “perfect detection” setting (using ground-truth boxes) and in a realistic setting, where crops come from model-predicted boxes.
For information extraction, we annotated 2886 field and item instances spanning merchant-level attributes (e.g., merchant name, receipt number, merchant phone number), temporal and transactional fields (dates, times, subtotals, taxes, totals), and line-item attributes (item names, quantities, unit prices, line totals, and coarse item categories). These labels are linked to their corresponding text regions and receipts, allowing us to study extraction performance under different combinations of ground-truth versus predicted boxes and OCR outputs.
To support fair and reproducible evaluation across all three tasks, we partition KORIE into non-overlapping train, validation, and test splits at the receipt level, so that no receipt appears in more than one split and no text crop or information extraction instance from a given receipt leaks across splits. All bounding boxes, OCR crops, and IE annotations are assigned to splits according to their parent receipt. We adopt a roughly 60/20/20 split for train/validation/test in our experiments; future work can reuse these splits directly or re-partition the corpus depending on the target application, as long as the receipt-level isolation is preserved.

Dataset Scale and Ongoing Expansion

Although the current release of KORIE contains 748 receipts, it provides 17,587 OCR crops and 2886 field- and item-level IE annotations, enabling fine-grained evaluation at the region and entity level. KORIE is an actively expanding benchmark: we are continuously adding newly collected mobile-captured receipts, and future public releases will substantially increase the number and diversity of store types, layouts, and acquisition conditions. Because the benchmark is designed for diagnostic analysis rather than leaderboard-scale ranking, small differences in mAP between closely performing detectors (e.g., YOLOv10 vs. YOLOv11) should not be overinterpreted. Future versions will additionally provide bootstrapped confidence intervals to support more rigorous statistical comparison across models.

4. Methodology

Our receipt–understanding pipeline consists of three sequential stages: (1) text detection and localization; (2) optical character recognition (OCR); (3) information extraction (IE). Given an input receipt image, Stage 1 predicts bounding boxes for all textual regions, Stage 2 transcribes each cropped region into raw text, and Stage 3 maps these transcriptions into a structured representation containing merchant-level fields and item-level attributes. This modular design allows for each stage of the pipeline to be evaluated independently as well as jointly. Figure 3 provides an overview of the full processing workflow.

4.1. Information Detection

The first stage of our pipeline is the localization of text regions on Korean receipts. Given an input receipt image I, the goal is to predict a set of bounding boxes { b i } that tightly cover textual content (headers, item lines, totals, and footer messages) while remaining robust to blur, skew, and thermal artifacts. On KORIE, we treat detection as a single-class problem (“text”) and evaluate using mean Average Precision (mAP) at IoU thresholds following standard object detection practice.
To study how different modeling paradigms behave on dense, small-text receipts, we consider three families of detectors:
  • Weakly supervised object localization (WSOL) using class activation maps;
  • Transformer-based detection with DINO-DETR;
  • The YOLO evolution of one-stage detectors.

4.1.1. Weakly Supervised Baselines (WSOL)

As a low-annotation-cost reference, we adopt the weakly supervised object localization (WSOL) framework of Choe et al. [29] and its public implementation (https://github.com/clovaai/wsolevaluation (accessed on 14 October 2025)). Rather than relying on a single Class Activation Mapping (CAM) baseline [52], we use several CAM-style WSOL methods implemented in this library, including vanilla CAM, Hide-and-Seek (HaS) [53], Adversarial Complementary Learning (ACoL) [54], Self-Produced Guidance (SPG) [55], Attention-based Dropout Layer (ADL) [56], and CutMix-based variants [57].
Each method is instantiated with the same set of backbones as in the original WSOL benchmark: VGG16, InceptionV3, and ResNet-50. The networks are trained with image-level supervision and produce localization score maps that highlight regions contributing to the “text” prediction. At test time, we upsample these maps to the input resolution, threshold them, apply connected-component analysis, and convert the resulting regions into bounding boxes. Because WSOL does not use box-level supervision during training, the predicted boxes are typically coarser and may merge nearby text instances; nonetheless, this family of methods provides a diverse weakly supervised baseline against which we can compare fully supervised detectors such as DINO and the YOLO variants on KORIE.
Rationale for WSOL Inclusion
We emphasize that WSOL is not intended to serve as a competitive alternative to modern supervised text detectors. Its role in KORIE is instead to provide a lightweight, low-annotation baseline representative of scenarios where bounding box supervision is unavailable or costly. As expected, WSOL methods perform substantially worse than supervised detectors on dense, small-text layouts. These results therefore function primarily as a diagnostic lower bound on the difficulty of the dataset under weak supervision. While KORIE could additionally support stronger supervised text spotting models such as DB/DB++ or PSENet, we leave full benchmarking of these architectures to future work as the dataset continues to expand.

4.1.2. Transformer-Based Detection with DINO

To represent the family of transformer-based detectors, we employ DINO-DETR variants [28] with multi-scale deformable attention [47]. Given an input image I, a CNN or vision-transformer backbone (ResNet-50 or Swin-L) extracts multi-scale feature maps, which are fed to a transformer encoder–decoder architecture. A fixed set of object queries iteratively attend to these features to produce bounding boxes and class logits. DINO stabilizes training with denoising queries and improved label assignment, which is particularly beneficial for small, densely packed text instances.
On KORIE, we fine-tune four-scale and five-scale DINO configurations from COCO-pretrained weights, adapting the classification head to a single “text” class. The loss combines a Hungarian-matched set prediction objective with classification, 1 box regression, and generalized IoU terms. We retain multi-scale features at high resolutions to better capture small Hangul glyphs and narrow item lines.

4.1.3. One-Stage Detection with YOLO

For deployment-oriented scenarios, one-stage detectors from the YOLO family remain highly attractive due to their speed and simplicity. We therefore evaluate YOLOv9, YOLOv10, and YOLOv11 [25,26,27] as strong, fully supervised baselines. These models divide the input image into a grid of feature locations and, for each location and anchor (or anchor-free equivalent), directly regress bounding box coordinates and predict class scores.
On KORIE, we fine-tune each YOLO variant from official COCO-pretrained checkpoints after adapting the detection heads to a single “text” class. We use standard compound losses combining classification, objectness, and IoU-based box regression, with minor adjustments to the input resolution and training augmentations (e.g., random scaling, cropping, and color jitter) to reflect the tall, narrow aspect ratios and low-contrast strokes of thermal receipts. Among our detectors, YOLO models serve as fast, high-precision baselines that are straightforward to integrate into real-world receipt processing systems.
Integration into the Pipeline
All three detector types output bounding boxes in the image coordinate system. For the OCR stage, we crop image patches according to either (i) ground-truth boxes (to measure recognition difficulty in isolation) or (ii) detector-predicted boxes (to quantify the impact of localization errors). The resulting transcriptions are then fed into the information extraction component, which maps raw text and positional cues into structured fields and item records. This design allows us to separately analyze the contributions and failure modes of weakly supervised, transformer-based, and YOLO-style detectors within a unified receipt understanding pipeline.

4.2. OCR Methodology

Given detected text regions on a receipt image, the second stage of our pipeline performs optical character recognition (OCR). For each cropped text image x corresponding to a bounding box b, the goal is to predict a character sequence y = ( y 1 , , y T ) over a vocabulary that includes Hangul syllables, Latin letters, digits, and common punctuation. On KORIE, each of the 17,587 OCR crops is paired with a ground-truth transcription, enabling evaluation under both (i) “perfect detection” (crops from ground-truth boxes) and (ii) realistic detection (crops from model-predicted boxes). We report character error rate (CER) and word error rate (WER).
  • Training Protocol and Fairness
To ensure a balanced comparison, all OCR models with trainable parameters were fine-tuned on the KORIE OCR subset. This includes EasyOCR, PaddleOCR, and the attention-based BiGRU recognizer. Tesseract does not support model-level fine-tuning; instead, we optimized preprocessing and single-line configuration for best performance on KORIE. This setup ensures that differences in performance reflect model architecture rather than a mismatch in training regime.
To cover common deployment practices and stronger neural baselines, we evaluate three categories of recognizers:
  • A classical engine (Tesseract/pytesseract);
  • Toolkit-based recognizers (EasyOCR and PaddleOCR), both fine-tuned on KORIE;
  • A neural attention-based BiGRU model trained on KORIE crops.

4.2.1. Classical OCR Engine (Tesseract/Pytesseract)

Tesseract with the official Korean and English language packs [30], accessed via pytesseract [31], serves as our classical baseline. Tesseract does not expose trainable weights, so fine-tuning is not possible. Instead, we optimize the preprocessing pipeline (grayscale conversion, contrast normalization, fixed-height padding) and use single-line recognition mode to reduce segmentation artifacts that are common in narrow receipt lines. This model therefore represents a strong, language-aware, but non-trainable baseline for comparison.

4.2.2. Toolkit-Based OCR (EasyOCR)

EasyOCR [32] provides multilingual text recognition with a CNN backbone, sequence modeling layers, and CTC decoding. We fine-tune the Korean–English EasyOCR model on the KORIE OCR subset using the official training API. Crops are resized to a fixed height with preserved aspect ratio and padded laterally as necessary. This fine-tuning allows us to evaluate EasyOCR not only as an off-the-shelf toolkit but as a domain-adapted recognizer for Korean thermal receipts.

4.2.3. Toolkit-Based OCR (PaddleOCR)

PaddleOCR [58] is another widely used multilingual OCR toolkit. We fine-tune the PP-OCRv3 multilingual recognizer on KORIE using the standard PaddleOCR training configuration. The same crop normalization as in EasyOCR is applied. This provides a strong domain-adapted baseline that reflects the capabilities of modern OCR toolkits when trained on receipt-specific data.

4.2.4. Neural Attention-Based BiGRU Recognizer

As a fully neural baseline tailored to Korean receipts, we train an attention-based BiGRU recognizer [33,34] from scratch on the KORIE OCR subset. A convolutional feature extractor maps each input crop to a sequence of feature vectors, which are processed by bidirectional GRU layers. An additive attention decoder generates characters one by one, trained with cross-entropy loss using teacher forcing. We apply mild augmentations (horizontal scaling, small rotation, brightness/contrast jitter) to improve robustness to receipt artifacts such as thermal banding and faded ink. Beam search is used at inference to obtain the most likely sequence.
Evaluation Settings
Each OCR model is evaluated under two conditions: (1) GT-box OCR, in which models receive crops from ground-truth boxes to isolate recognition difficulty, and (2) detector-crop OCR, in which crops come from predicted boxes produced by the detection models in Section 4.1. The resulting transcriptions—both perfect- and detection-based—are then passed to the information extraction module, allowing us to quantify the propagation of OCR errors into structured downstream tasks.

4.3. Information Extraction Methodology

The final stage of our pipeline maps OCR outputs on a receipt to a structured representation suitable for downstream analytics. Given a set of text lines { ( y ˜ i , b i ) } , where y ˜ i is the OCR transcription (from either ground-truth or detector-crop settings in Section 4.2) and b i is the corresponding bounding box, the information extraction (IE) module predicts a schema-level object containing (i) header fields and (ii) a line-item table.

4.3.1. Task Formulation

For each receipt r, we define a set of header fields
H = { merchant _ name , merchant _ address , merchant _ phone , receipt _ number , date , time , subtotal , tax , total }
In total, KORIE provides 2886 annotated header and item instances across 748 receipts. The IE problem is to map the unstructured OCR text { y ˜ i } (and optionally their spatial coordinates { b i } ) into this structured schema.

4.3.2. Input Representation and Linearization

Most IE models operate over text sequences, whereas our annotations are defined over spatially localized lines. To bridge this gap, we first sort all text lines for a receipt r in reading order (top-to-bottom, left-to-right) based on the bounding boxes { b i } . We then construct a linearized textual context:
C r = [ HEADER ] L 1 [ SEP ] L 2 [ SEP ] [ SEP ] L M r ,
where L i is the normalized text of the i-th line. For models that accept only text, we output C r as is; for models that can exploit layout, we optionally augment each line with its normalized coordinates (e.g., “ L i (x = 0.18, y = 0.07)”). This linearization preserves coarse layout information while presenting the receipt in a form amenable to language models.

4.3.3. Instruction-Following IE with Language Models

We instantiate the IE module using instruction-following decoder-only language models that take the linearized receipt context C r as input and generate the target schema as structured text. Concretely, each model receives a prompt of the following form:
You are an information extraction system for Korean retail receipts.
Given the following receipt text, extract the fields {merchant_name, date, total, …} and the list of items, each with {item_name, brand, quantity, unit, unit_price, line_total}.
Return your answer as valid JSON with these keys only.
Receipt: C_r
In this work, all instruction-following language models are evaluated strictly in the zero-shot setting ( k = 0 ). Although the framework supports few-shot prompting, we do not use 1–3 shot examples in our experiments to ensure consistent, comparable evaluation across all models.
Multiple language models of different sizes are evaluated under this protocol; each model produces a JSON string which we parse into a structured prediction H ^ r , L ^ r for receipt r. Invalid JSON (e.g., due to truncation or hallucinated keys) is handled by a lightweight post-processor that attempts to repair brackets and remove unknown fields; if parsing still fails, the prediction for that receipt is treated as empty.

4.3.4. Value Normalization and Alignment

Before scoring, both gold and predicted values are normalized to reduce spurious mismatches. For textual fields (e.g., merchant names, item names, brands), we lowercase Latin text, strip extra whitespace, and remove non-informative punctuation. For numeric fields (quantities, unit prices, line totals, subtotals, taxes, and totals), we remove thousands separators, normalize decimal points, and, where applicable, strip currency symbols while preserving the numeric value. Dates and times are converted to a canonical format (YYYY-MM-DD and HH:MM) when possible.
For header fields, each receipt contributes at most one gold value per key, and we align predictions by field name. For line items, we treat each annotated item as a multi-field record and match predicted items to gold items using a greedy string-similarity-based alignment over item names and prices (breaking ties with minimal edit distance). Once aligned, we compare individual attributes (e.g., brand, quantity, unit_price) between matched pairs.

4.3.5. Evaluation Metrics

Following common practice in receipt IE, we report both per-field and overall metrics. For each field type c (e.g., brand, quantity, unit, unit_price, line_total, total), we compute the following:
  • Accuracy: the fraction of receipts (for header fields) or items (for line-item fields) where the normalized predicted value exactly matches the normalized ground truth;
  • F1 score: micro-averaged F1 over field instances, where a prediction is counted as a true positive if the normalized value matches the ground truth, as a false positive if it is present but incorrect, and as a false negative if the field is omitted.
For line items, we additionally report an overall score that micro-averages F1 across all item attributes, reflecting the end-to-end correctness of the extracted item table. These metrics are computed separately under two text conditions: (1) oracle-text IE, where the model consumes ground-truth transcriptions; and (2) OCR-text IE, where the model consumes OCR outputs from the different recognizers in Section 4.2. Comparing these settings allows us to quantify how OCR errors (and upstream detection errors) propagate into field-level and item-level extraction performance on Korean receipts.

4.3.6. Error Propagation Analysis

To quantify how detection accuracy influences downstream stages, all OCR and IE experiments are conducted under two complementary conditions: (1) GT-box OCR/IE, where models receive crops extracted from ground-truth bounding boxes, and (2) detector-crop OCR/IE, where crops are taken from predicted boxes produced by each detector. The GT-box setting isolates recognition difficulty by removing localization noise, while the detector-crop setting exposes the compounded effects of missed, truncated, or merged text regions. Differences in CER/WER and IE F1 between these two settings quantify the propagation of detection errors into transcription and structured extraction.

5. Experiments

In this section we describe the experimental setup, evaluation metrics, and empirical results for the three tasks defined on KORIE: information detection (text localization), OCR, and structured information extraction. Unless otherwise stated, all models are trained and evaluated on the splits described in Section 3 and follow the methodology in Section 4.

5.1. Experimental Setup

All detection and OCR experiments were conducted on a workstation equipped with a single NVIDIA RTX 4090 GPU (24 GB VRAM). Information extraction experiments using instruction-tuned large language models were performed on NVIDIA A40 GPUs (64 GB VRAM).

5.1.1. Tasks and Metrics

We evaluate models on three tasks: information detection (text-region localization), OCR (transcription of localized text), and information extraction (mapping OCR outputs to a structured schema).
For detection, we treat all text as a single class and report standard object detection metrics. YOLO-based models are evaluated using Ultralytics’ built-in validator, which provides Box Precision (P), Recall (R), mAP@0.50, and mAP@0.50:0.95. DINO-based models are evaluated using COCO-style Average Precision (AP) at IoU thresholds from 0.10 to 0.90. WSOL baselines are evaluated using the wsolevaluation metrics, which include classification accuracy, overall localization score, and localization at fixed IoU thresholds.
For OCR, we report character error rate (CER) and word error rate (WER) over the 17,587 text crops, in two settings: OCR on ground-truth boxes (GT-box) and OCR on detector-predicted boxes (detector-crop). For information extraction, we report per-field Accuracy and micro-averaged F1 for header fields and line-item attributes, as well as an overall micro-F1 across all item attributes over the 2886 annotated IE instances.

5.1.2. Baselines

For detection, we compare three families of models:
  • WSOL baselines: CAM, HaS, ACoL, SPG, ADL, and CutMix-style methods with VGG16, InceptionV3, and ResNet-50 backbones, implemented via the WSOL evaluation framework [29];
  • Transformer-based detectors: DINO-DETR variants (4-scale and 5-scale) with ResNet-50 and Swin-L backbones [28,47];
  • YOLO family: YOLOv9, YOLOv10, and YOLOv11 one-stage detectors [25,26,27].
For OCR, we consider the following:
  • Tesseract/pytesseract with Korean and English language packs [30,31];
  • EasyOCR configured for Korean and English [32];
  • PaddleOCR, a multilingual toolkit-based recognizer;
  • Attention-based BiGRU recognizer trained on KORIE OCR crops (GT-box setting), as described in Section 4.2.
For information extraction, we evaluate instruction-following decoder-only language models strictly in the zero-shot setting, as described in Section 4.3. Models consume either ground-truth transcriptions (oracle-text IE) or OCR outputs from the recognizers above (OCR-text IE).

5.1.3. Implementation Details

All detectors are finetuned from publicly available COCO-pretrained checkpoints. For DINO and YOLO models, we adopt default training hyperparameters from the respective repositories and adjust image resolution to better accommodate tall, narrow receipts. WSOL baselines are trained using the original wsolevaluation configuration [29], with minimal changes limited to dataset loading and the number of training epochs.
For the BiGRU OCR model, we resize each crop to a fixed height and variable width (with padding), apply light geometric and photometric augmentations, and train using Adam with early stopping on the validation set. Language models for IE are run in zero-shot mode using a fixed prompt template; all outputs are parsed as JSON and post-processed as described in Section 4.3.

5.2. Detection Results on KORIE

Table 1, Table 2 and Table 3 summarize detection performance on the KORIE test set (119 receipts, 4134 text instances) for YOLO, DINO, and WSOL models, respectively.

5.2.1. YOLO Detectors

Table 1 reports YOLO results. Among fully supervised models, YOLOv11 attains the best overall performance with Box Precision 0.842, Recall 0.852, mAP@0.50 = 0.888, and mAP@0.50:0.95 = 0.762. YOLOv10 offers a strong compact baseline: despite having only 3.0M parameters and 8.1 GFLOPs, it achieves mAP@0.50 = 0.860 and mAP@0.50:0.95 = 0.751, only moderately below YOLOv11. YOLOv9 is slightly weaker (mAP@0.50 = 0.856, mAP@0.50:0.95 = 0.747) but exhibits similar precision/recall (0.832/0.832). Overall, newer YOLO generations provide measurable gains on dense, small-text receipts while retaining the speed and simplicity that make one-stage detectors attractive for deployment.

5.2.2. DINO Detectors

Table 2 presents COCO-style AP for DINO-based detectors. The four-scale ResNet-50 variant reaches AP@0.50 = 0.776 and maintains strong localization up to AP@0.90 = 0.571, indicating that many text boxes are tightly aligned with ground truth. The five-scale ResNet-50 and four-scale Swin-L variants are slightly weaker at AP@0.50 (0.754 and 0.760, respectively) and show similar behavior across IoU thresholds. In practice, DINO yields precise box shapes and handles skewed or rotated text segments well, but has higher computational cost and longer training time than the YOLO family.

5.2.3. WSOL Baselines

Table 3 reports WSOL results for ResNet-50 backbones across CAM, HaS, ACoL, SPG, ADL, and CutMix. Classification accuracies range from 69.1% (HaS) to 91.8% (ACoL), indicating that all methods learn a reasonably effective image-level “text vs. background” signal. However, localization performance is markedly lower. The best configuration (ACoL with ResNet-50) reaches 23.8% overall localization and 6.6% localization at IoU = 0.50, while other methods remain below 5% at IoU = 0.50. Many configurations effectively fail at moderate IoU thresholds, reflecting the difficulty of using image-level supervision alone for dense, small-text layouts such as Korean receipts. Similar trends are observed for VGG16 and InceptionV3 backbones.

5.3. OCR Results on KORIE

Table 4 reports CER and WER for all OCR models on the KORIE OCR subset (17,587 crops) using detector-predicted boxes from YOLOv11. Overall, toolkit-based recognizers perform best, with PaddleOCR achieving the lowest error rates.
Tesseract/pytesseract benefits from explicit Korean language packs but exhibits the highest CER among the compared methods (25.43%), and a WER of 35.26%. EasyOCR substantially improves CER (17.36%) but still suffers from word-level errors (31.43% WER), especially on lines that mix small Hangul with Latin text and digits.
The attention-based BiGRU recognizer, trained exclusively on KORIE crops, achieves a moderate CER (22.69%), lower than Tesseract but higher than the toolkit-based recognizers. However, its WER is noticeably higher (44.47%), indicating that while many individual characters are correct, word segmentation and boundary errors (e.g., merged or split tokens) lead to more frequent word-level mismatches. This highlights the sensitivity of WER to tokenization decisions under noisy OCR conditions.
PaddleOCR provides the strongest overall performance, obtaining the lowest CER (15.84%) and WER (26.73%). This suggests that its more recent multilingual training and architecture transfer well to Korean retail receipts, even without task-specific fine-tuning.
As discussed in Section 4, OCR is evaluated under both ground-truth bounding boxes (GT-box) and detector-predicted boxes. Table 5 summarizes this comparison and quantifies how localization errors propagate into transcription quality across different OCR models.
Across models, common failure patterns include confusion among visually similar glyphs (e.g., 1/|/I), fragmentation of Hangul jamo under blur or thermal banding, and misreading of decimal and thousands separators in prices. These errors are particularly impactful for line-item and total fields in the IE task, as discussed next.

5.4. Information Extraction Results

We evaluate instruction-following language models on the item-level information extraction task described in Section 4.3, focusing on five key attributes per line item: item_name, quantity, unit_price, size_unit, and total_price. Models consume linearized receipt text with layout hints and are prompted to output a JSON structure containing these fields for each item. All results reported here use OCR text (from our default OCR configuration) rather than ground-truth transcriptions, reflecting a realistic end-to-end pipeline.
Table 6 summarizes overall Accuracy and micro-averaged Precision/Recall/F1 for different instruction-tuned LLMs. The overall scores aggregate across all field instances; per-field scores for the strongest models are shown separately in Table 7. All results reported here are in the zero-shot setting ( k = 0 ) described in Section 4.3, using OCR text (from our default OCR configuration) rather than ground-truth transcriptions, reflecting a realistic end-to-end pipeline.

5.4.1. Impact of Model Size

Across both Llama and Qwen families, larger models consistently outperform smaller ones. For example, within the Llama series, overall F1 increases from 8% (Llama-3.2-1B) to 18% (Llama-3.2-3B) and 25% (Llama-3.1-8B). A similar trend holds for Qwen, where Qwen2.5-3B and Qwen2.5-7B substantially improve over Qwen2.5-1.5B. This indicates that larger models are better able to cope with OCR noise, mixed scripts, and the loosely structured nature of receipt text.

5.4.2. Item Fields vs. Numeric Fields

Per-field scores reveal a strong asymmetry between textual and numeric attributes. Table 7 reports Accuracy and F1 for the three strongest models. For Llama-3.1-8B-Instruct, item_name and quantity achieve moderate performance (item_name Accuracy 52.16%, F1 ≈ 55%; quantity Accuracy 62.08%, F1 ≈ 66%), while unit_price, size_unit, and total_price are essentially never predicted correctly (near-zero F1). Qwen2.5-3B-Instruct shows a similar pattern with slightly higher scores on item_name and quantity, and Qwen2.5-7B-Instruct performs best on item_name but remains weak on price fields.
These results indicate that current models are relatively effective at identifying item descriptions and integer-like quantities, but struggle to recover monetary values and units. Crucially, this difficulty stems not only from OCR noise but also from a pronounced domain mismatch: instruction-tuned LLMs are not trained on Korean receipt text, small-font Hangul, thermal-printing artifacts, or Korean currency formatting conventions (e.g., thousands separators and “원”). As a result, even when OCR correctly recognizes digits, LLMs may fail to normalize or align price fields in the required schema. Because our evaluation does not isolate OCR-specific errors for numeric fields, the observed failures should be interpreted as a combination of upstream OCR noise and the lack of domain-adapted Korean training data in current LLMs.

5.4.3. Best-Performing Models and Remaining Gap

By overall F1, the strongest models on KORIE IE are Llama-3.1-8B-Instruct and Qwen2.5-3B-Instruct (both around 25% F1, with overall Accuracy ≈ 22–23%), followed closely by Qwen2.5-7B-Instruct (24% F1). These models provide a useful starting point for receipt IE but still leave substantial headroom, especially on price-related fields that are critical for downstream applications such as expense management and analytics.

5.4.4. Error Characteristics and Pipeline Effects

The low performance on unit_price and total_price reflects a combination of factors: OCR misreading digits or decimal/thousands separators, subtle layout cues (e.g., alignment of price columns) that are lost in linearization, and the need for precise numerical matching in our evaluation. In contrast, item_name and quantity can tolerate some lexical variation and are easier for language models to recover from context.
Comparing these IE results with OCR error rates in Section 5.3 underscores how upstream detection and recognition errors propagate into structured outputs, especially for numerically sensitive fields. Even with relatively strong OCR (e.g., PaddleOCR with CER 15.84%, WER 26.73%), end-to-end item-level F1 remains around 25%, highlighting an open opportunity for tighter integration of visual models, OCR, and structured numerical reasoning. Overall, KORIE exposes a challenging but realistic regime for information extraction from Korean receipts, where even strong instruction-tuned LLMs struggle with price-related attributes under realistic noise.

5.4.5. Role of Domain Mismatch

General-purpose LLMs are typically trained on web-scale text corpora dominated by clean digital Korean, with limited exposure to small-font Hangul, mixed-script retail codes, thermally degraded numeric strings, or receipt-style tabular layouts. This mismatch significantly limits their zero-shot extraction capability on KORIE. Future work will include domain-adapted fine-tuning and an isolated evaluation of numeric OCR accuracy to more precisely characterize error sources.

6. Conclusions and Future Work

We introduced KORIE, a Korean-language receipt benchmark for text detection, OCR, and information extraction under realistic acquisition conditions. The dataset comprises 748 scanned thermal receipts with bounding-box annotations, 17,587 OCR crops, and 2886 item- and field-level IE instances, focusing on dense, small text, mixed Hangul–Latin scripts, and artifacts such as blur, skew, and thermal banding. We benchmarked weakly supervised WSOL methods, transformer-based DINO detectors, and YOLOv9/v10/v11, as well as OCR engines (Tesseract/pytesseract, EasyOCR, PaddleOCR, and an attention-based BiGRU) and zero-shot instruction-tuned LLMs (Llama 3.x and Qwen2.5) for IE. YOLOv11 and PaddleOCR emerge as strong detection/OCR baselines, while zero-shot LLMs achieve only modest item-level F1 (around 25%) and largely fail on price-related fields, reflecting both compounded detection–OCR–IE errors and a strong domain mismatch, as current LLMs are not trained on noisy Korean receipt text or thermal-print numeric conventions.
Our study has several limitations. KORIE is moderate in scale and focused on Korean retail receipts scanned at 300 dpi, so results may not fully transfer to other document types, languages, or more extreme scanning conditions. The baseline suite, while diverse, is not exhaustive, and IE is evaluated only in a zero-shot setting with strict exact matching for numeric fields, which does not credit near-misses or partially correct values.
Future work includes enlarging and diversifying the dataset (e.g., more stores, devices, and degraded receipts), enriching annotations with higher-level semantics (such as vendor categories and discount structures), and exploring few-shot and fully fine-tuned IE models, layout-aware architectures, and OCR-free or multi-task approaches that more tightly couple visual, textual, and numerical reasoning. We release KORIE together with baseline models, OCR crops, IE annotations, and evaluation code to support reproducible research on Korean receipt understanding and to encourage further advances in small-text detection, Hangul-aware recognition, and robust information extraction.

Author Contributions

Conceptualization, M.S.K. and M.M.; Methodology, M.S.K., M.M., and M.A.; Software, M.S.K., M.F.S., and M.A.; Validation, M.S.K. and H.S.K.; Formal analysis, M.S.K. and M.F.S.; Investigation, H.S.K.; Resources, H.S.K.; Data curation, M.S.K.; Writing—original draft, M.S.K.; Writing—review and editing, M.S.K. and H.S.K.; Visualization, H.S.K.; Supervision, H.S.K.; Project administration, H.S.K.; Funding acquisition, H.S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grants funded by the Korean government (Ministry of Science and ICT, MSIT) (RS-2023-NR076833) and partly by the Regional Innovation System & Education (RISE) program through the (Chungbuk Regional Innovation System & Education Center), funded by the Ministry of Education (MOE) and the (Chungcheongbuk-do), Republic of Korea (2025-RISE-11-014-03).

Data Availability Statement

The original data presented in the study are openly available in Github at https://github.com/MahmoudSalah/KORIE.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Yang, J.; Yang, G. Modified convolutional neural network based on dropout and the stochastic gradient descent optimizer. Algorithms 2018, 11, 28. [Google Scholar] [CrossRef]
  2. Iman, M.; Arabnia, H.R.; Rasheed, K. A review of deep transfer learning and recent advancements. Technologies 2023, 11, 40. [Google Scholar] [CrossRef]
  3. Tan, C.; Sun, F.; Kong, T.; Zhang, W.; Yang, C.; Liu, C. A survey on deep transfer learning. In Artificial Neural Networks and Machine Learning—ICANN 2018, Proceedings of the International Conference on Artificial Neural Networks, Rhodes, Greece, 4–7 October 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 270–279. [Google Scholar]
  4. Mahmoud, M.; Yagoub, B.; Senussi, M.F.; Abdalla, M.; Kasem, M.S.; Kang, H.S. Two-Stage Video Violence Detection Framework Using GMFlow and CBAM-Enhanced ResNet3D. Mathematics 2025, 13, 1226. [Google Scholar] [CrossRef]
  5. Elsharkawy, M.; Abdelhalim, I.; Taher, F.; Ali, A.; Ghazal, M.; Mahmoud, A.; Giridharan, G.; Wang, W.; El-Baz, A. Crossdr: Bridging 2D And 3D Features For Diabetic Retinopathy Classification Using Context-Aware Cross-Attention. In Proceedings of the 2025 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 14–17 September 2025; IEEE: New York, NY, USA, 2025; pp. 2300–2305. [Google Scholar]
  6. Abdelhalim, I.; Elsharkawy, M.; Taher, F.; Wang, W.; Ghazal, M.; Mahmoud, A.; El-Baz, A. RAW: Region Attention-Weighted Guided Network with Inter-Region Exchange for AMD Grading. In Proceedings of the 2025 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 14–17 September 2025; IEEE: New York, NY, USA, 2025; pp. 2271–2276. [Google Scholar]
  7. Kasem, M.S.; Abdallah, A.; Abdelhalim, I.; Alghamdi, N.S.; Contractor, S.; El-Baz, A. IHRRB-DINO: Identifying High-Risk Regions of Breast Masses in Mammogram Images Using Data-Driven Instance Noise (DINO). In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Marrakesh, Morocco, 6–10 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 113–122. [Google Scholar]
  8. Abdelhalim, I.; Almalki, Y.; Abdallah, A.; Karam, R.; Alduraibi, S.; Basha, M.; Mohamed, H.; Ghazal, M.; Mahmoud, A.; Alghamdi, N.S.; et al. A deep learning framework for accurate mammographic mass classification using local context attention module. Med Phys. 2025, 52, e18119. [Google Scholar] [CrossRef]
  9. Abdallah, A.; Mozafari, J.; Piryani, B.; Jatowt, A. Dear: Dual-stage document reranking with reasoning agents via llm distillation. arXiv 2025, arXiv:2508.16998. [Google Scholar]
  10. Chen, C.; Han, D.; Wang, J. Multimodal encoder-decoder attention networks for visual question answering. IEEE Access 2020, 8, 35662–35671. [Google Scholar] [CrossRef]
  11. Abdalla, M.; Kasem, M.S.; Mahmoud, M.; Yagoub, B.; Senussi, M.F.; Abdallah, A.; Hun Kang, S.; Kang, H.S. ReceiptQA: A Question-Answering Dataset for Receipt Understanding. Mathematics 2025, 13, 1760. [Google Scholar] [CrossRef]
  12. Kasem, M.S.; Mahmoud, M.; Kang, H.S. Advancements and challenges in arabic optical character recognition: A comprehensive survey. ACM Comput. Surv. 2023, 58, 1–37. [Google Scholar] [CrossRef]
  13. Senussi, M.F.; Abdalla, M.; Kasem, M.S.; Mahmoud, M.; Kang, H.S. U2-LFOR: A Two-Stage U2 Network for Light-Field Occlusion Removal. Mathematics 2025, 13, 2748. [Google Scholar] [CrossRef]
  14. Kasem, M.S.; Mahmoud, M.; Yagoub, B.; Senussi, M.F.; Abdalla, M.; Kang, H.S. HTTD: A Hierarchical Transformer for Accurate Table Detection in Document Images. Mathematics 2025, 13, 266. [Google Scholar] [CrossRef]
  15. Wang, G. Customer segmentation in the digital marketing using a Q-learning based differential evolution algorithm integrated with K-means clustering. PLoS ONE 2025, 20, e0318519. [Google Scholar] [CrossRef]
  16. Kasem, M.S.; Hamada, M.; Taj-Eddin, I. Customer profiling, segmentation, and sales prediction using AI in direct marketing. Neural Comput. Appl. 2024, 36, 4995–5005. [Google Scholar] [CrossRef]
  17. Hnamte, V.; Nhung-Nguyen, H.; Hussain, J.; Hwa-Kim, Y. A novel two-stage deep learning model for network intrusion detection: LSTM-AE. IEEE Access 2023, 11, 37131–37148. [Google Scholar] [CrossRef]
  18. Mahmoud, M.; Kasem, M.; Abdallah, A.; Kang, H.S. Ae-lstm: Autoencoder with lstm-based intrusion detection in iot. In Proceedings of the 2022 International Telecommunications Conference (ITC-Egypt), Alexandria, Egypt, 26–28 July 2022; IEEE: New York, NY, USA, 2022; pp. 1–6. [Google Scholar]
  19. Senussi, M.F.; Abdalla, M.; Kasem, M.S.; Mahmoud, M.; Kang, H.S. Learning to remove occlusions in light field images using multiscale receptive fields and feature pyramid networks. Sci. Rep. 2025, 15, 36978. [Google Scholar] [CrossRef] [PubMed]
  20. Huang, Z.; Chen, K.; He, J.; Bai, X.; Karatzas, D.; Lu, S.; Jawahar, C. Icdar2019 competition on scanned receipt ocr and information extraction. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 20–25 September 2019; IEEE: New York, NY, USA, 2019; pp. 1516–1520. [Google Scholar]
  21. Park, S.; Shin, S.; Lee, B.; Lee, J.; Surh, J.; Seo, M.; Lee, H. Cord: A consolidated receipt dataset for post-ocr parsing. In Proceedings of the Workshop on Document Intelligence at NeurIPS 2019, Vancouver, BC, Canada, 14 December 2019. [Google Scholar]
  22. Xu, Y.; Lv, T.; Cui, L.; Wang, G.; Lu, Y.; Florencio, D.; Zhang, C.; Wei, F. XFUND: A benchmark dataset for multilingual visually rich form understanding. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 3214–3224. [Google Scholar]
  23. Nathan, S. Japanese-Mobile-Receipt-OCR-1.3 K: A Comprehensive Dataset Analysis and Fine-tuned Vision-Language Model for Structured Receipt Data Extraction. Authorea 2025, preprints. [Google Scholar]
  24. Abdallah, A.; Abdalla, M.; Elkasaby, M.; Elbendary, Y.; Jatowt, A. AMuRD: Annotated Arabic-English Receipt Dataset for Key Information Extraction and Classification. arXiv 2023, arXiv:2309.09800. [Google Scholar]
  25. Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 1–21. [Google Scholar]
  26. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
  27. Ultralytics. Ultralytics YOLO11 Model Documentation. 2024. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 8 October 2025).
  28. Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
  29. Choe, J.; Oh, S.J.; Lee, S.; Chun, S.; Akata, Z.; Shim, H. Evaluating weakly supervised object localization methods right. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3133–3142. [Google Scholar]
  30. Smith, R. An overview of the Tesseract OCR engine. In Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, Brazil, 23–26 September 2007; IEEE: New York, NY, USA, 2007; Volume 2, pp. 629–633. [Google Scholar]
  31. Madmaze. pytesseract: Python Wrapper for Tesseract OCR. 2024. Available online: https://github.com/madmaze/pytesseract (accessed on 8 October 2025).
  32. JaidedAI. EasyOCR. 2020. Available online: https://github.com/JaidedAI/EasyOCR (accessed on 8 October 2025).
  33. Shi, B.; Bai, X.; Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 2298–2304. [Google Scholar] [CrossRef]
  34. Cheng, Z.; Bai, F.; Xu, Y.; Zheng, G.; Pu, S.; Zhou, S. Focusing attention: Towards accurate text recognition in natural images. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5076–5084. [Google Scholar]
  35. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
  36. Sun, H.; Kuang, Z.; Yue, X.; Lin, C.; Zhang, W. Spatial dual-modality graph reasoning for key information extraction. arXiv 2021, arXiv:2103.14470. [Google Scholar] [CrossRef]
  37. Jaume, G.; Ekenel, H.K.; Thiran, J.P. Funsd: A dataset for form understanding in noisy scanned documents. In Proceedings of the 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Sydney, NSW, Australia, 22–25 September 2019; IEEE: New York, NY, USA, 2019; Volume 2, pp. 1–6. [Google Scholar]
  38. Veit, A.; Matera, T.; Neumann, L.; Matas, J.; Belongie, S. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv 2016, arXiv:1601.07140. [Google Scholar] [CrossRef]
  39. Nayef, N.; Patel, Y.; Busta, M.; Chowdhury, P.N.; Karatzas, D.; Khlif, W.; Matas, J.; Pal, U.; Burie, J.C.; Liu, C.l.; et al. Icdar2019 robust reading challenge on multi-lingual scene text detection and recognition—rrc-mlt-2019. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 20–25 September 2019; IEEE: New York, NY, USA, 2019; pp. 1582–1587. [Google Scholar]
  40. Shi, B.; Yao, C.; Liao, M.; Yang, M.; Xu, P.; Cui, L.; Belongie, S.; Lu, S.; Bai, X. Icdar2017 competition on reading chinese text in the wild (rctw-17). In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; IEEE: New York, NY, USA, 2017; Volume 1, pp. 1429–1434. [Google Scholar]
  41. Singh, A.; Pang, G.; Toh, M.; Huang, J.; Galuba, W.; Hassner, T. Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8802–8812. [Google Scholar]
  42. Tian, Z.; Huang, W.; He, T.; He, P.; Qiao, Y. Detecting text in natural image with connectionist text proposal network. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 56–72. [Google Scholar]
  43. Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. East: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5551–5560. [Google Scholar]
  44. Wang, W.; Xie, E.; Li, X.; Hou, W.; Lu, T.; Yu, G.; Shao, S. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9336–9345. [Google Scholar]
  45. Liao, M.; Wan, Z.; Yao, C.; Chen, K.; Bai, X. Real-time scene text detection with differentiable binarization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11474–11481. [Google Scholar]
  46. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
  47. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
  48. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
  49. Bautista, D.; Atienza, R. Scene text recognition with permuted autoregressive sequence models. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 178–196. [Google Scholar]
  50. Xu, Y.; Li, M.; Cui, L.; Huang, S.; Wei, F.; Zhou, M. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, CA, USA, 6–10 July 2020; pp. 1192–1200. [Google Scholar]
  51. Huang, Y.; Lv, T.; Cui, L.; Lu, Y.; Wei, F. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 4083–4091. [Google Scholar]
  52. Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
  53. Kumar Singh, K.; Jae Lee, Y. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3524–3533. [Google Scholar]
  54. Zhang, X.; Wei, Y.; Feng, J.; Yang, Y.; Huang, T.S. Adversarial complementary learning for weakly supervised object localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1325–1334. [Google Scholar]
  55. Zhang, X.; Wei, Y.; Kang, G.; Yang, Y.; Huang, T. Self-produced guidance for weakly-supervised object localization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 597–613. [Google Scholar]
  56. Choe, J.; Shim, H. Attention-based dropout layer for weakly supervised object localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2219–2228. [Google Scholar]
  57. Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar]
  58. Cui, C.; Sun, T.; Lin, M.; Gao, T.; Zhang, Y.; Liu, J.; Wang, X.; Zhang, Z.; Zhou, C.; Liu, H.; et al. Paddleocr 3.0 technical report. arXiv 2025, arXiv:2507.05595. [Google Scholar] [CrossRef]
Figure 1. Examples of images in scanned KORIE dataset.
Figure 1. Examples of images in scanned KORIE dataset.
Mathematics 14 00187 g001
Figure 2. Examples of images in mobile-capture KORIE dataset.
Figure 2. Examples of images in mobile-capture KORIE dataset.
Mathematics 14 00187 g002
Figure 3. Overview of the KORIE processing pipeline. Stage 1 performs text detection and localization (YOLO, DINO, WSOL) to produce bounding boxes and cropped text regions. Stage 2 applies OCR engines (Tesseract, EasyOCR, PaddleOCR, BiGRU) to generate raw text strings. Stage 3 uses instruction-tuned or fine-tuned language models to extract structured information such as merchant fields, dates, totals, and item-level records.
Figure 3. Overview of the KORIE processing pipeline. Stage 1 performs text detection and localization (YOLO, DINO, WSOL) to produce bounding boxes and cropped text regions. Stage 2 applies OCR engines (Tesseract, EasyOCR, PaddleOCR, BiGRU) to generate raw text strings. Stage 3 uses instruction-tuned or fine-tuned language models to extract structured information such as merchant fields, dates, totals, and item-level records.
Mathematics 14 00187 g003
Table 1. YOLO detection performance on the KORIE test set (119 receipts, 4134 text instances). Metrics are Box Precision (P), Recall (R), mAP@0.50, and mAP@0.50:0.95 reported by the Ultralytics validator.
Table 1. YOLO detection performance on the KORIE test set (119 receipts, 4134 text instances). Metrics are Box Precision (P), Recall (R), mAP@0.50, and mAP@0.50:0.95 reported by the Ultralytics validator.
ModelPRmAP@0.50mAP@0.50:0.95
YOLOv90.8320.8320.8560.747
YOLOv100.8370.8360.8600.751
YOLOv110.8420.8520.8880.762
Table 2. DINO detection performance on the KORIE test set. We report COCO-style Average Precision (AP) at different IoU thresholds.
Table 2. DINO detection performance on the KORIE test set. We report COCO-style Average Precision (AP) at different IoU thresholds.
Model0.100.200.300.400.500.600.700.800.90
DINO 4-scale,
ResNet-50
0.8150.7970.7870.7760.7760.7720.7710.7570.571
DINO 5-scale,
ResNet-50
0.7990.7790.7680.7560.7540.7530.7490.7310.535
DINO 4-scale,
Swin-L
0.7990.7810.7740.7630.7600.7570.7550.7330.613
Table 3. WSOL detection performance on KORIE using the wsolevaluation framework with a ResNet-50 backbone. “Cls” is classification accuracy (%), “Loc” is overall localization score (%), and Loc@IoU- τ is localization at threshold τ (%).
Table 3. WSOL detection performance on KORIE using the wsolevaluation framework with a ResNet-50 backbone. “Cls” is classification accuracy (%), “Loc” is overall localization score (%), and Loc@IoU- τ is localization at threshold τ (%).
MethodCls (%)Loc (%)Loc@IoU-0.10Loc@IoU-0.20Loc@IoU-0.50
CAM86.513.552.531.11.6
HaS69.19.039.318.03.3
ACoL91.823.872.157.46.6
SPG89.416.459.039.33.3
ADL76.715.057.429.53.3
CutMix72.811.345.923.01.6
Table 4. OCR performance on the KORIE OCR subset (17,587 crops) using detector-predicted boxes from YOLOv11. Metrics are character error rate (CER) and word error rate (WER); lower is better.
Table 4. OCR performance on the KORIE OCR subset (17,587 crops) using detector-predicted boxes from YOLOv11. Metrics are character error rate (CER) and word error rate (WER); lower is better.
ModelCER (%)WER (%)
Tesseract/pytesseract25.4335.26
EasyOCR17.3631.43
Attention-based BiGRU22.6944.47
PaddleOCR15.8426.73
Table 5. Error propagation analysis: comparison of OCR performance under ground-truth bounding boxes (GT-box) versus detector-predicted boxes (YOLOv11). Lower CER/WER indicates better performance.
Table 5. Error propagation analysis: comparison of OCR performance under ground-truth bounding boxes (GT-box) versus detector-predicted boxes (YOLOv11). Lower CER/WER indicates better performance.
ModelGT-Box OCRDetector-Crop OCR
CER (%) WER (%) CER (%) WER (%)
Tesseract18.9227.1425.4335.26
EasyOCR11.4520.0317.3631.43
BiGRU16.8829.5222.6944.47
PaddleOCR11.7220.8515.8426.73
Table 6. Information extraction performance on KORIE line-item fields using OCR text. We report overall Accuracy and micro-averaged Precision, Recall, and F1 across all item attributes.
Table 6. Information extraction performance on KORIE line-item fields using OCR text. We report overall Accuracy and micro-averaged Precision, Recall, and F1 across all item attributes.
ModelAccuracy (%)Precision (%)Recall (%)F1 (%)
Llama-3.2-1B-Instruct5.5817.06.08.0
Llama-3.2-3B-Instruct16.8619.017.018.0
Qwen2.5-1.5B-Instruct17.2018.017.018.0
Qwen2.5-7B-Instruct21.7026.022.024.0
Llama-3.1-8B-Instruct22.8628.023.025.0
Qwen2.5-3B-Instruct23.1627.023.025.0
Table 7. Per-field IE performance (Accuracy/F1) for the three strongest models on KORIE. F1 values are approximated from the reported scores (e.g., 0.55 → 55%).
Table 7. Per-field IE performance (Accuracy/F1) for the three strongest models on KORIE. F1 values are approximated from the reported scores (e.g., 0.55 → 55%).
FieldLlama-3.1-8BQwen2.5-3BQwen2.5-7B
item_name52.16/5553.11/5361.41 /62
quantity62.08/6662.63/6347.05/50
unit_price0.08/00.00/00.04/0
size_unit0.00/00.00/00.00/0
total_price0.00/00.08/00.00/0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kasem, M.S.; Mahmoud, M.; Senussi, M.F.; Abdalla, M.; Kang, H.S. KORIE: A Multi-Task Benchmark for Detection, OCR, and Information Extraction on Korean Retail Receipts. Mathematics 2026, 14, 187. https://doi.org/10.3390/math14010187

AMA Style

Kasem MS, Mahmoud M, Senussi MF, Abdalla M, Kang HS. KORIE: A Multi-Task Benchmark for Detection, OCR, and Information Extraction on Korean Retail Receipts. Mathematics. 2026; 14(1):187. https://doi.org/10.3390/math14010187

Chicago/Turabian Style

Kasem, Mahmoud SalahEldin, Mohamed Mahmoud, Mostafa Farouk Senussi, Mahmoud Abdalla, and Hyun Soo Kang. 2026. "KORIE: A Multi-Task Benchmark for Detection, OCR, and Information Extraction on Korean Retail Receipts" Mathematics 14, no. 1: 187. https://doi.org/10.3390/math14010187

APA Style

Kasem, M. S., Mahmoud, M., Senussi, M. F., Abdalla, M., & Kang, H. S. (2026). KORIE: A Multi-Task Benchmark for Detection, OCR, and Information Extraction on Korean Retail Receipts. Mathematics, 14(1), 187. https://doi.org/10.3390/math14010187

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop