1. Introduction
Retail receipts are compact, visually noisy documents that mix dense, small text with variable layouts, and real-world thermal printing artifacts. In Korean receipts, these difficulties are amplified by Hangul syllable blocks, frequent code-switching with Latin characters and numerals, and domain conventions such as currency symbols and vendor-specific formatting. Despite substantial progress in scene text detection and optical character recognition (OCR), Korean retail receipts remain underrepresented in public benchmarks, limiting reproducible evaluation of both visual front-end quality and downstream information extraction.
Across computer vision and adjacent domains, deep learning has advanced through both architectural innovation and practical training strategies. Work on activation functions and optimization has improved gradient flow and convergence in deep convolutional networks (CNNs) [
1], while transfer learning from large-scale pretraining has become a default lever for data-scarce tasks [
2,
3,
4]. Similar methodologies drive impact in medical imaging (diagnosis, detection, and classification) [
5,
6,
7,
8] and natural language-assisted systems [
9,
10,
11,
12]; in applied machine learning [
13,
14]; customer segmentation [
15,
16]; intrusion detection for IoT [
17,
18]; and removal of occlusions [
19]. These gains, enabled by powerful DCNNs, scalable GPUs, and improved training curricula, yield strong performance in generic text detection and recognition. However, fine-grained document scenarios—such as receipts—remain challenging due to dense, small text; mixed scripts; layout irregularities; and the need to map raw text into structured key–value fields and item records.
Public datasets have shaped research on document images, but each covers a different slice of the problem.
SROIE standardized text localization, recognition, and key information extraction on scanned English receipts [
20], catalyzing evaluation but with relatively clean images and a narrow linguistic scope.
CORD broadened coverage with non-English (Indonesian) receipts and richer annotations for post-OCR parsing [
21]. Multilingual form datasets such as
XFUND target key–value understanding across languages but focus on forms rather than retail receipts [
22]. More recently, mobile-captured Japanese receipt corpora emphasize in-the-wild imaging conditions and mixed scripts [
23], while
AMuRD introduces a large multilingual (Arabic/English) collection oriented toward downstream extraction and item classification [
24]. Together, these resources advance linguistic diversity, realism, and information extraction, yet there is still no focused, Korean-language benchmark that jointly supports text detection, OCR, and structured receipt understanding.
Against this backdrop, retail receipts are a particularly exacting testbed for end-to-end receipt understanding: the physical degradation of thermal paper yields low-contrast strokes, significant ink fading, and printer head banding, while mechanical handling introduces creases that disrupt dense layouts interleaving small Hangul, Latin characters, and numerals in tight columns [
20,
21]. To rigorously measure progress across the stages of a receipt pipeline, we present
KORIE, a Korean receipts benchmark that spans text detection, OCR, and information extraction under realistic digitization conditions. We provide bounding-box annotations and transcriptions and benchmark diverse detectors (YOLOv9/10/11, DINO 4/5-scale with ResNet-50, Swin-L, and a WSOL-style baseline) [
25,
26,
27,
28,
29] paired with OCR engines (Tesseract,
pytesseract, EasyOCR, and a CRNN/attention-style BiGRU recognizer) [
30,
31,
32,
33,
34]. Using standardized metrics—mAP for detection and CER/WER for OCR—we report results both on ground-truth boxes and on detector crops, enabling fair, reproducible comparisons and clear attribution of error to localization versus recognition [
35]. On top of these, we define an information extraction track over key–value fields and item-level records, allowing us to study how visual errors propagate to structured outputs.
KORIE comprises 748 Korean retail receipts captured under real-world conditions (thermal banding, folds, skew, blur, glare, compression). It provides bounding-box annotations and transcriptions to quantify performance on detection and OCR, and additionally offers structured annotations for key information extraction. From these receipts, we derive 17,587 box-level text images for OCR experiments and 2886 labeled entities and item records for information extraction, covering merchant names, dates, receipt numbers, item descriptions, quantities, prices, and totals. We evaluate YOLOv9/10/11, DINO (4/5-scale, ResNet-50), Swin-L, and a WSOL baseline alongside OCR engines Tesseract, pytesseract, EasyOCR, and a neural Attention-Gated BiGRU. We report mAP for detection and CER/WER for OCR in two settings—(1) OCR on ground-truth boxes (to isolate recognition quality) and (2) OCR on detector crops (to expose compounding errors)—and use the resulting transcriptions as inputs to the information extraction task.
While KORIE enables end-to-end evaluation, it is designed to keep the stages of detection, OCR, and information extraction analytically separable. By providing fine-grained box-level and field-level annotations, the benchmark allows practitioners to measure progress on the core visual tasks that underpin receipt pipelines and to disentangle visual failures from semantic extraction errors. Our analyses highlight persistent challenges specific to Korean receipts—small-font Hangul recognition, jamo fragmentation under blur, confusions among visually similar glyphs (e.g., 1/|/I), and their downstream impact on key fields such as totals and dates—providing a clear target for future model and augmentation design. We release dataset splits, OCR crops, information extraction labels, and evaluation scripts to facilitate reproducible, apples-to-apples comparisons on this important yet under-served domain.
Our work makes the following contributions:
We introduce KORIE, a Korean-language receipt dataset comprising 748 scanned thermal receipts with bounding-box annotations and transcriptions, specifically designed to stress dense, small text; mixed scripts; and real-world imaging artifacts.
We define a multi-task benchmark covering three foundational stages of receipt understanding—text detection, OCR, and information extraction. We report mAP for detection, CER/WER for recognition under two settings (OCR on ground-truth boxes and on detector crops), and field-level metrics for extracting key–value fields and item records, enabling clear attribution of error across the pipeline.
We provide a large-scale OCR subset with 17,587 box-level text images and an information extraction subset with 2886 annotated entities and item lines, covering merchant names, dates, receipt numbers, line items, quantities, prices, and totals, thereby supporting both visual and semantic evaluation.
We benchmark a diverse suite of detectors (YOLOv9/10/11, DINO 4/5-scale with ResNet-50, Swin-L, and WSOL-style baselines) and OCR engines (Tesseract, pytesseract, EasyOCR, and an attention-based BiGRU recognizer), and analyze failure modes specific to Korean receipts—such as small-font Hangul degradation, jamo fragmentation, and visually similar glyph confusions—releasing dataset splits and code to facilitate future work on Hangul-aware detection, recognition, and extraction under realistic acquisition conditions.
3. Dataset
3.1. Ethical Considerations and Data Collection
All receipts in
KORIE were collected by the authors from restaurants, supermarkets, and retail stores across South Korea. The dataset combines two acquisition modes. The initial portion of the corpus was digitized using an HP Color LaserJet MFP scanner at 300 dpi; although scanned, these receipts already exhibit substantial real-world degradation inherited from the thermal paper itself—ink fading, head band streaking, friction wear, partial creasing, and locally washed-out characters. These artifacts originate from the physical condition of the receipt rather than the digitization process and therefore remain visible even under flatbed scanning.
Figure 1 illustrates some examples of the scanned dataset. In addition,
KORIE includes a growing subset of mobile-captured receipts, which introduce genuine in-the-wild imaging variations such as blur, shadows, glare, perspective skew, and uneven illumination.
Figure 2 illustrates some examples of the scanned dataset. The dataset is designed as a hybrid and continuously expanding benchmark, and all references to “challenging” or “real-world” conditions pertain to the physical thermal artifacts and the mobile-captured portion of the corpus.
To protect privacy, we adopted a four-step de-identification pipeline before any annotation took place. First, annotators conducted a line-by-line review of each receipt image to identify potentially sensitive content (e.g., customer identifiers, full card numbers, or loyalty IDs). Second, such fields were manually obscured or removed at the image level. Third, a second annotator verified that no obvious personal identifiers remained. Finally, an independent cross-check was performed on a random subset of receipts to ensure adherence to the privacy guidelines. Only de-identified images were admitted into the final corpus and used for annotation or experiments.
3.2. Annotation Protocol
We established detailed annotation guidelines tailored to Korean retail receipts, covering merchant names, item descriptions, prices, dates, and other transactional fields. The guidelines included specialized protocols for South Korean conventions, such as local date formats, currency notation in Korean won, and mixed-script merchant names that combine Hangul and Latin characters. Ambiguous cases (e.g., unconventional abbreviations or visually degraded text) were documented in a shared guideline and resolved through consensus among annotators.
All receipts were annotated using the MakeSense platform. Annotators drew bounding boxes around textual regions and exported them in both YOLO and COCO formats to support a wide range of detection pipelines. The YOLO annotations follow the standard normalized representation, while the COCO annotations store in absolute pixel units. For OCR tasks, we developed a custom annotation system that maintains the positional integrity and reading order of Korean text, ensuring that line- and region-level transcriptions can be reliably associated with their corresponding bounding boxes.
Beyond generic text regions, we also defined item-specific and field-specific annotations to capture semantics that are particularly relevant for retail receipts. For each item line, annotators labeled the item name, a coarse item classification, quantity, unit or packaging information, and unit/total prices; brand names were recorded when explicitly present on the receipt. At the header and footer levels, we annotated key fields such as merchant name, merchant address, merchant phone number, receipt number, dates and times, subtotals, taxes, and overall totals. These richer annotations were iteratively refined through feedback loops with domain experts familiar with Korean retail formats and form the basis of the information extraction track in KORIE.
3.3. Statistics and Splits
The KORIE dataset comprises 748 Korean retail receipts. Each receipt typically contains a dense mix of small-font Hangul, Latin characters, and numerals, along with store headers, line-item lists, subtotals/totals, tax information, and footer messages. Real-world artifacts such as folds, skew, blur, glare, and thermal banding are prevalent due to in-the-wild capture and aging of the printed receipts.
For OCR-focused experiments, we extracted text-region crops from the annotated receipts, yielding a total of 17,587 word-level images used as OCR instances. Each crop is paired with its transcription and inherits its spatial coordinates from the underlying detection annotations, enabling evaluation both in a “perfect detection” setting (using ground-truth boxes) and in a realistic setting, where crops come from model-predicted boxes.
For information extraction, we annotated 2886 field and item instances spanning merchant-level attributes (e.g., merchant name, receipt number, merchant phone number), temporal and transactional fields (dates, times, subtotals, taxes, totals), and line-item attributes (item names, quantities, unit prices, line totals, and coarse item categories). These labels are linked to their corresponding text regions and receipts, allowing us to study extraction performance under different combinations of ground-truth versus predicted boxes and OCR outputs.
To support fair and reproducible evaluation across all three tasks, we partition KORIE into non-overlapping train, validation, and test splits at the receipt level, so that no receipt appears in more than one split and no text crop or information extraction instance from a given receipt leaks across splits. All bounding boxes, OCR crops, and IE annotations are assigned to splits according to their parent receipt. We adopt a roughly 60/20/20 split for train/validation/test in our experiments; future work can reuse these splits directly or re-partition the corpus depending on the target application, as long as the receipt-level isolation is preserved.
Dataset Scale and Ongoing Expansion
Although the current release of KORIE contains 748 receipts, it provides 17,587 OCR crops and 2886 field- and item-level IE annotations, enabling fine-grained evaluation at the region and entity level. KORIE is an actively expanding benchmark: we are continuously adding newly collected mobile-captured receipts, and future public releases will substantially increase the number and diversity of store types, layouts, and acquisition conditions. Because the benchmark is designed for diagnostic analysis rather than leaderboard-scale ranking, small differences in mAP between closely performing detectors (e.g., YOLOv10 vs. YOLOv11) should not be overinterpreted. Future versions will additionally provide bootstrapped confidence intervals to support more rigorous statistical comparison across models.
4. Methodology
Our receipt–understanding pipeline consists of three sequential stages: (1) text detection and localization; (2) optical character recognition (OCR); (3) information extraction (IE). Given an input receipt image, Stage 1 predicts bounding boxes for all textual regions, Stage 2 transcribes each cropped region into raw text, and Stage 3 maps these transcriptions into a structured representation containing merchant-level fields and item-level attributes. This modular design allows for each stage of the pipeline to be evaluated independently as well as jointly.
Figure 3 provides an overview of the full processing workflow.
4.1. Information Detection
The first stage of our pipeline is the localization of text regions on Korean receipts. Given an input receipt image I, the goal is to predict a set of bounding boxes that tightly cover textual content (headers, item lines, totals, and footer messages) while remaining robust to blur, skew, and thermal artifacts. On KORIE, we treat detection as a single-class problem (“text”) and evaluate using mean Average Precision (mAP) at IoU thresholds following standard object detection practice.
To study how different modeling paradigms behave on dense, small-text receipts, we consider three families of detectors:
Weakly supervised object localization (WSOL) using class activation maps;
Transformer-based detection with DINO-DETR;
The YOLO evolution of one-stage detectors.
4.1.1. Weakly Supervised Baselines (WSOL)
As a low-annotation-cost reference, we adopt the weakly supervised object localization (WSOL) framework of Choe et al. [
29] and its public implementation (
https://github.com/clovaai/wsolevaluation (accessed on 14 October 2025)). Rather than relying on a single Class Activation Mapping (CAM) baseline [
52], we use several CAM-style WSOL methods implemented in this library, including vanilla CAM, Hide-and-Seek (HaS) [
53], Adversarial Complementary Learning (ACoL) [
54], Self-Produced Guidance (SPG) [
55], Attention-based Dropout Layer (ADL) [
56], and CutMix-based variants [
57].
Each method is instantiated with the same set of backbones as in the original WSOL benchmark: VGG16, InceptionV3, and ResNet-50. The networks are trained with image-level supervision and produce localization score maps that highlight regions contributing to the “text” prediction. At test time, we upsample these maps to the input resolution, threshold them, apply connected-component analysis, and convert the resulting regions into bounding boxes. Because WSOL does not use box-level supervision during training, the predicted boxes are typically coarser and may merge nearby text instances; nonetheless, this family of methods provides a diverse weakly supervised baseline against which we can compare fully supervised detectors such as DINO and the YOLO variants on KORIE.
Rationale for WSOL Inclusion
We emphasize that WSOL is not intended to serve as a competitive alternative to modern supervised text detectors. Its role in KORIE is instead to provide a lightweight, low-annotation baseline representative of scenarios where bounding box supervision is unavailable or costly. As expected, WSOL methods perform substantially worse than supervised detectors on dense, small-text layouts. These results therefore function primarily as a diagnostic lower bound on the difficulty of the dataset under weak supervision. While KORIE could additionally support stronger supervised text spotting models such as DB/DB++ or PSENet, we leave full benchmarking of these architectures to future work as the dataset continues to expand.
4.1.2. Transformer-Based Detection with DINO
To represent the family of transformer-based detectors, we employ DINO-DETR variants [
28] with multi-scale deformable attention [
47]. Given an input image
I, a CNN or vision-transformer backbone (ResNet-50 or Swin-L) extracts multi-scale feature maps, which are fed to a transformer encoder–decoder architecture. A fixed set of object queries iteratively attend to these features to produce bounding boxes and class logits. DINO stabilizes training with denoising queries and improved label assignment, which is particularly beneficial for small, densely packed text instances.
On KORIE, we fine-tune four-scale and five-scale DINO configurations from COCO-pretrained weights, adapting the classification head to a single “text” class. The loss combines a Hungarian-matched set prediction objective with classification, box regression, and generalized IoU terms. We retain multi-scale features at high resolutions to better capture small Hangul glyphs and narrow item lines.
4.1.3. One-Stage Detection with YOLO
For deployment-oriented scenarios, one-stage detectors from the YOLO family remain highly attractive due to their speed and simplicity. We therefore evaluate YOLOv9, YOLOv10, and YOLOv11 [
25,
26,
27] as strong, fully supervised baselines. These models divide the input image into a grid of feature locations and, for each location and anchor (or anchor-free equivalent), directly regress bounding box coordinates and predict class scores.
On KORIE, we fine-tune each YOLO variant from official COCO-pretrained checkpoints after adapting the detection heads to a single “text” class. We use standard compound losses combining classification, objectness, and IoU-based box regression, with minor adjustments to the input resolution and training augmentations (e.g., random scaling, cropping, and color jitter) to reflect the tall, narrow aspect ratios and low-contrast strokes of thermal receipts. Among our detectors, YOLO models serve as fast, high-precision baselines that are straightforward to integrate into real-world receipt processing systems.
Integration into the Pipeline
All three detector types output bounding boxes in the image coordinate system. For the OCR stage, we crop image patches according to either (i) ground-truth boxes (to measure recognition difficulty in isolation) or (ii) detector-predicted boxes (to quantify the impact of localization errors). The resulting transcriptions are then fed into the information extraction component, which maps raw text and positional cues into structured fields and item records. This design allows us to separately analyze the contributions and failure modes of weakly supervised, transformer-based, and YOLO-style detectors within a unified receipt understanding pipeline.
4.2. OCR Methodology
Given detected text regions on a receipt image, the second stage of our pipeline performs optical character recognition (OCR). For each cropped text image x corresponding to a bounding box b, the goal is to predict a character sequence over a vocabulary that includes Hangul syllables, Latin letters, digits, and common punctuation. On KORIE, each of the 17,587 OCR crops is paired with a ground-truth transcription, enabling evaluation under both (i) “perfect detection” (crops from ground-truth boxes) and (ii) realistic detection (crops from model-predicted boxes). We report character error rate (CER) and word error rate (WER).
To ensure a balanced comparison, all OCR models with trainable parameters were fine-tuned on the KORIE OCR subset. This includes EasyOCR, PaddleOCR, and the attention-based BiGRU recognizer. Tesseract does not support model-level fine-tuning; instead, we optimized preprocessing and single-line configuration for best performance on KORIE. This setup ensures that differences in performance reflect model architecture rather than a mismatch in training regime.
To cover common deployment practices and stronger neural baselines, we evaluate three categories of recognizers:
A classical engine (Tesseract/pytesseract);
Toolkit-based recognizers (EasyOCR and PaddleOCR), both fine-tuned on KORIE;
A neural attention-based BiGRU model trained on KORIE crops.
4.2.1. Classical OCR Engine (Tesseract/Pytesseract)
Tesseract with the official Korean and English language packs [
30], accessed via
pytesseract [
31], serves as our classical baseline. Tesseract does not expose trainable weights, so fine-tuning is not possible. Instead, we optimize the preprocessing pipeline (grayscale conversion, contrast normalization, fixed-height padding) and use single-line recognition mode to reduce segmentation artifacts that are common in narrow receipt lines. This model therefore represents a strong, language-aware, but non-trainable baseline for comparison.
4.2.2. Toolkit-Based OCR (EasyOCR)
EasyOCR [
32] provides multilingual text recognition with a CNN backbone, sequence modeling layers, and CTC decoding. We fine-tune the Korean–English EasyOCR model on the
KORIE OCR subset using the official training API. Crops are resized to a fixed height with preserved aspect ratio and padded laterally as necessary. This fine-tuning allows us to evaluate EasyOCR not only as an off-the-shelf toolkit but as a domain-adapted recognizer for Korean thermal receipts.
4.2.3. Toolkit-Based OCR (PaddleOCR)
PaddleOCR [
58] is another widely used multilingual OCR toolkit. We fine-tune the PP-OCRv3 multilingual recognizer on
KORIE using the standard PaddleOCR training configuration. The same crop normalization as in EasyOCR is applied. This provides a strong domain-adapted baseline that reflects the capabilities of modern OCR toolkits when trained on receipt-specific data.
4.2.4. Neural Attention-Based BiGRU Recognizer
As a fully neural baseline tailored to Korean receipts, we train an attention-based BiGRU recognizer [
33,
34] from scratch on the
KORIE OCR subset. A convolutional feature extractor maps each input crop to a sequence of feature vectors, which are processed by bidirectional GRU layers. An additive attention decoder generates characters one by one, trained with cross-entropy loss using teacher forcing. We apply mild augmentations (horizontal scaling, small rotation, brightness/contrast jitter) to improve robustness to receipt artifacts such as thermal banding and faded ink. Beam search is used at inference to obtain the most likely sequence.
Evaluation Settings
Each OCR model is evaluated under two conditions: (1) GT-box OCR, in which models receive crops from ground-truth boxes to isolate recognition difficulty, and (2) detector-crop OCR, in which crops come from predicted boxes produced by the detection models in
Section 4.1. The resulting transcriptions—both perfect- and detection-based—are then passed to the information extraction module, allowing us to quantify the propagation of OCR errors into structured downstream tasks.
4.3. Information Extraction Methodology
The final stage of our pipeline maps OCR outputs on a receipt to a structured representation suitable for downstream analytics. Given a set of text lines
, where
is the OCR transcription (from either ground-truth or detector-crop settings in
Section 4.2) and
is the corresponding bounding box, the information extraction (IE) module predicts a schema-level object containing (i) header fields and (ii) a line-item table.
4.3.1. Task Formulation
For each receipt
r, we define a set of header fields
In total, KORIE provides 2886 annotated header and item instances across 748 receipts. The IE problem is to map the unstructured OCR text (and optionally their spatial coordinates ) into this structured schema.
4.3.2. Input Representation and Linearization
Most IE models operate over text sequences, whereas our annotations are defined over spatially localized lines. To bridge this gap, we first sort all text lines for a receipt
r in reading order (top-to-bottom, left-to-right) based on the bounding boxes
. We then construct a linearized textual context:
where
is the normalized text of the
i-th line. For models that accept only text, we output
as is; for models that can exploit layout, we optionally augment each line with its normalized coordinates (e.g., “
(x = 0.18, y = 0.07)”). This linearization preserves coarse layout information while presenting the receipt in a form amenable to language models.
4.3.3. Instruction-Following IE with Language Models
We instantiate the IE module using instruction-following decoder-only language models that take the linearized receipt context as input and generate the target schema as structured text. Concretely, each model receives a prompt of the following form:
You are an information extraction system for Korean retail receipts.
Given the following receipt text, extract the fields {merchant_name, date, total, …} and the list of items, each with {item_name, brand, quantity, unit, unit_price, line_total}.
Return your answer as valid JSON with these keys only.
Receipt: C_r
In this work, all instruction-following language models are evaluated strictly in the zero-shot setting (). Although the framework supports few-shot prompting, we do not use 1–3 shot examples in our experiments to ensure consistent, comparable evaluation across all models.
Multiple language models of different sizes are evaluated under this protocol; each model produces a JSON string which we parse into a structured prediction for receipt r. Invalid JSON (e.g., due to truncation or hallucinated keys) is handled by a lightweight post-processor that attempts to repair brackets and remove unknown fields; if parsing still fails, the prediction for that receipt is treated as empty.
4.3.4. Value Normalization and Alignment
Before scoring, both gold and predicted values are normalized to reduce spurious mismatches. For textual fields (e.g., merchant names, item names, brands), we lowercase Latin text, strip extra whitespace, and remove non-informative punctuation. For numeric fields (quantities, unit prices, line totals, subtotals, taxes, and totals), we remove thousands separators, normalize decimal points, and, where applicable, strip currency symbols while preserving the numeric value. Dates and times are converted to a canonical format (YYYY-MM-DD and HH:MM) when possible.
For header fields, each receipt contributes at most one gold value per key, and we align predictions by field name. For line items, we treat each annotated item as a multi-field record and match predicted items to gold items using a greedy string-similarity-based alignment over item names and prices (breaking ties with minimal edit distance). Once aligned, we compare individual attributes (e.g., brand, quantity, unit_price) between matched pairs.
4.3.5. Evaluation Metrics
Following common practice in receipt IE, we report both per-field and overall metrics. For each field type c (e.g., brand, quantity, unit, unit_price, line_total, total), we compute the following:
Accuracy: the fraction of receipts (for header fields) or items (for line-item fields) where the normalized predicted value exactly matches the normalized ground truth;
F1 score: micro-averaged F1 over field instances, where a prediction is counted as a true positive if the normalized value matches the ground truth, as a false positive if it is present but incorrect, and as a false negative if the field is omitted.
For line items, we additionally report an overall score that micro-averages F1 across all item attributes, reflecting the end-to-end correctness of the extracted item table. These metrics are computed separately under two text conditions: (1) oracle-text IE, where the model consumes ground-truth transcriptions; and (2) OCR-text IE, where the model consumes OCR outputs from the different recognizers in
Section 4.2. Comparing these settings allows us to quantify how OCR errors (and upstream detection errors) propagate into field-level and item-level extraction performance on Korean receipts.
4.3.6. Error Propagation Analysis
To quantify how detection accuracy influences downstream stages, all OCR and IE experiments are conducted under two complementary conditions: (1) GT-box OCR/IE, where models receive crops extracted from ground-truth bounding boxes, and (2) detector-crop OCR/IE, where crops are taken from predicted boxes produced by each detector. The GT-box setting isolates recognition difficulty by removing localization noise, while the detector-crop setting exposes the compounded effects of missed, truncated, or merged text regions. Differences in CER/WER and IE F1 between these two settings quantify the propagation of detection errors into transcription and structured extraction.
5. Experiments
In this section we describe the experimental setup, evaluation metrics, and empirical results for the three tasks defined on
KORIE: information detection (text localization), OCR, and structured information extraction. Unless otherwise stated, all models are trained and evaluated on the splits described in
Section 3 and follow the methodology in
Section 4.
5.1. Experimental Setup
All detection and OCR experiments were conducted on a workstation equipped with a single NVIDIA RTX 4090 GPU (24 GB VRAM). Information extraction experiments using instruction-tuned large language models were performed on NVIDIA A40 GPUs (64 GB VRAM).
5.1.1. Tasks and Metrics
We evaluate models on three tasks: information detection (text-region localization), OCR (transcription of localized text), and information extraction (mapping OCR outputs to a structured schema).
For detection, we treat all text as a single class and report standard object detection metrics. YOLO-based models are evaluated using Ultralytics’ built-in validator, which provides Box Precision (P), Recall (R), mAP@0.50, and mAP@0.50:0.95. DINO-based models are evaluated using COCO-style Average Precision (AP) at IoU thresholds from 0.10 to 0.90. WSOL baselines are evaluated using the wsolevaluation metrics, which include classification accuracy, overall localization score, and localization at fixed IoU thresholds.
For OCR, we report character error rate (CER) and word error rate (WER) over the 17,587 text crops, in two settings: OCR on ground-truth boxes (GT-box) and OCR on detector-predicted boxes (detector-crop). For information extraction, we report per-field Accuracy and micro-averaged F1 for header fields and line-item attributes, as well as an overall micro-F1 across all item attributes over the 2886 annotated IE instances.
5.1.2. Baselines
For detection, we compare three families of models:
WSOL baselines: CAM, HaS, ACoL, SPG, ADL, and CutMix-style methods with VGG16, InceptionV3, and ResNet-50 backbones, implemented via the WSOL evaluation framework [
29];
Transformer-based detectors: DINO-DETR variants (4-scale and 5-scale) with ResNet-50 and Swin-L backbones [
28,
47];
YOLO family: YOLOv9, YOLOv10, and YOLOv11 one-stage detectors [
25,
26,
27].
For OCR, we consider the following:
For information extraction, we evaluate instruction-following decoder-only language models strictly in the zero-shot setting, as described in
Section 4.3. Models consume either ground-truth transcriptions (oracle-text IE) or OCR outputs from the recognizers above (OCR-text IE).
5.1.3. Implementation Details
All detectors are finetuned from publicly available COCO-pretrained checkpoints. For DINO and YOLO models, we adopt default training hyperparameters from the respective repositories and adjust image resolution to better accommodate tall, narrow receipts. WSOL baselines are trained using the original wsolevaluation configuration [
29], with minimal changes limited to dataset loading and the number of training epochs.
For the BiGRU OCR model, we resize each crop to a fixed height and variable width (with padding), apply light geometric and photometric augmentations, and train using Adam with early stopping on the validation set. Language models for IE are run in zero-shot mode using a fixed prompt template; all outputs are parsed as JSON and post-processed as described in
Section 4.3.
5.2. Detection Results on KORIE
Table 1,
Table 2 and
Table 3 summarize detection performance on the
KORIE test set (119 receipts, 4134 text instances) for YOLO, DINO, and WSOL models, respectively.
5.2.1. YOLO Detectors
Table 1 reports YOLO results. Among fully supervised models, YOLOv11 attains the best overall performance with Box Precision 0.842, Recall 0.852, mAP@0.50 = 0.888, and mAP@0.50:0.95 = 0.762. YOLOv10 offers a strong compact baseline: despite having only 3.0M parameters and 8.1 GFLOPs, it achieves mAP@0.50 = 0.860 and mAP@0.50:0.95 = 0.751, only moderately below YOLOv11. YOLOv9 is slightly weaker (mAP@0.50 = 0.856, mAP@0.50:0.95 = 0.747) but exhibits similar precision/recall (0.832/0.832). Overall, newer YOLO generations provide measurable gains on dense, small-text receipts while retaining the speed and simplicity that make one-stage detectors attractive for deployment.
5.2.2. DINO Detectors
Table 2 presents COCO-style AP for DINO-based detectors. The four-scale ResNet-50 variant reaches AP@0.50 = 0.776 and maintains strong localization up to AP@0.90 = 0.571, indicating that many text boxes are tightly aligned with ground truth. The five-scale ResNet-50 and four-scale Swin-L variants are slightly weaker at AP@0.50 (0.754 and 0.760, respectively) and show similar behavior across IoU thresholds. In practice, DINO yields precise box shapes and handles skewed or rotated text segments well, but has higher computational cost and longer training time than the YOLO family.
5.2.3. WSOL Baselines
Table 3 reports WSOL results for ResNet-50 backbones across CAM, HaS, ACoL, SPG, ADL, and CutMix. Classification accuracies range from 69.1% (HaS) to 91.8% (ACoL), indicating that all methods learn a reasonably effective image-level “text vs. background” signal. However, localization performance is markedly lower. The best configuration (ACoL with ResNet-50) reaches 23.8% overall localization and 6.6% localization at IoU = 0.50, while other methods remain below 5% at IoU = 0.50. Many configurations effectively fail at moderate IoU thresholds, reflecting the difficulty of using image-level supervision alone for dense, small-text layouts such as Korean receipts. Similar trends are observed for VGG16 and InceptionV3 backbones.
5.3. OCR Results on KORIE
Table 4 reports CER and WER for all OCR models on the
KORIE OCR subset (17,587 crops) using detector-predicted boxes from YOLOv11. Overall, toolkit-based recognizers perform best, with PaddleOCR achieving the lowest error rates.
Tesseract/pytesseract benefits from explicit Korean language packs but exhibits the highest CER among the compared methods (25.43%), and a WER of 35.26%. EasyOCR substantially improves CER (17.36%) but still suffers from word-level errors (31.43% WER), especially on lines that mix small Hangul with Latin text and digits.
The attention-based BiGRU recognizer, trained exclusively on KORIE crops, achieves a moderate CER (22.69%), lower than Tesseract but higher than the toolkit-based recognizers. However, its WER is noticeably higher (44.47%), indicating that while many individual characters are correct, word segmentation and boundary errors (e.g., merged or split tokens) lead to more frequent word-level mismatches. This highlights the sensitivity of WER to tokenization decisions under noisy OCR conditions.
PaddleOCR provides the strongest overall performance, obtaining the lowest CER (15.84%) and WER (26.73%). This suggests that its more recent multilingual training and architecture transfer well to Korean retail receipts, even without task-specific fine-tuning.
As discussed in
Section 4, OCR is evaluated under both ground-truth bounding boxes (GT-box) and detector-predicted boxes.
Table 5 summarizes this comparison and quantifies how localization errors propagate into transcription quality across different OCR models.
Across models, common failure patterns include confusion among visually similar glyphs (e.g., 1/|/I), fragmentation of Hangul jamo under blur or thermal banding, and misreading of decimal and thousands separators in prices. These errors are particularly impactful for line-item and total fields in the IE task, as discussed next.
5.4. Information Extraction Results
We evaluate instruction-following language models on the item-level information extraction task described in
Section 4.3, focusing on five key attributes per line item:
item_name,
quantity,
unit_price,
size_unit, and
total_price. Models consume linearized receipt text with layout hints and are prompted to output a JSON structure containing these fields for each item. All results reported here use OCR text (from our default OCR configuration) rather than ground-truth transcriptions, reflecting a realistic end-to-end pipeline.
Table 6 summarizes overall Accuracy and micro-averaged Precision/Recall/F1 for different instruction-tuned LLMs. The overall scores aggregate across all field instances; per-field scores for the strongest models are shown separately in
Table 7. All results reported here are in the zero-shot setting (
) described in
Section 4.3, using OCR text (from our default OCR configuration) rather than ground-truth transcriptions, reflecting a realistic end-to-end pipeline.
5.4.1. Impact of Model Size
Across both Llama and Qwen families, larger models consistently outperform smaller ones. For example, within the Llama series, overall F1 increases from 8% (Llama-3.2-1B) to 18% (Llama-3.2-3B) and 25% (Llama-3.1-8B). A similar trend holds for Qwen, where Qwen2.5-3B and Qwen2.5-7B substantially improve over Qwen2.5-1.5B. This indicates that larger models are better able to cope with OCR noise, mixed scripts, and the loosely structured nature of receipt text.
5.4.2. Item Fields vs. Numeric Fields
Per-field scores reveal a strong asymmetry between textual and numeric attributes.
Table 7 reports Accuracy and F1 for the three strongest models. For Llama-3.1-8B-Instruct,
item_name and
quantity achieve moderate performance (item_name Accuracy 52.16%, F1 ≈ 55%; quantity Accuracy 62.08%, F1 ≈ 66%), while
unit_price,
size_unit, and
total_price are essentially never predicted correctly (near-zero F1). Qwen2.5-3B-Instruct shows a similar pattern with slightly higher scores on
item_name and
quantity, and Qwen2.5-7B-Instruct performs best on
item_name but remains weak on price fields.
These results indicate that current models are relatively effective at identifying item descriptions and integer-like quantities, but struggle to recover monetary values and units. Crucially, this difficulty stems not only from OCR noise but also from a pronounced domain mismatch: instruction-tuned LLMs are not trained on Korean receipt text, small-font Hangul, thermal-printing artifacts, or Korean currency formatting conventions (e.g., thousands separators and “원”). As a result, even when OCR correctly recognizes digits, LLMs may fail to normalize or align price fields in the required schema. Because our evaluation does not isolate OCR-specific errors for numeric fields, the observed failures should be interpreted as a combination of upstream OCR noise and the lack of domain-adapted Korean training data in current LLMs.
5.4.3. Best-Performing Models and Remaining Gap
By overall F1, the strongest models on KORIE IE are Llama-3.1-8B-Instruct and Qwen2.5-3B-Instruct (both around 25% F1, with overall Accuracy ≈ 22–23%), followed closely by Qwen2.5-7B-Instruct (24% F1). These models provide a useful starting point for receipt IE but still leave substantial headroom, especially on price-related fields that are critical for downstream applications such as expense management and analytics.
5.4.4. Error Characteristics and Pipeline Effects
The low performance on unit_price and total_price reflects a combination of factors: OCR misreading digits or decimal/thousands separators, subtle layout cues (e.g., alignment of price columns) that are lost in linearization, and the need for precise numerical matching in our evaluation. In contrast, item_name and quantity can tolerate some lexical variation and are easier for language models to recover from context.
Comparing these IE results with OCR error rates in
Section 5.3 underscores how upstream detection and recognition errors propagate into structured outputs, especially for numerically sensitive fields. Even with relatively strong OCR (e.g., PaddleOCR with CER 15.84%, WER 26.73%), end-to-end item-level F1 remains around 25%, highlighting an open opportunity for tighter integration of visual models, OCR, and structured numerical reasoning. Overall,
KORIE exposes a challenging but realistic regime for information extraction from Korean receipts, where even strong instruction-tuned LLMs struggle with price-related attributes under realistic noise.
5.4.5. Role of Domain Mismatch
General-purpose LLMs are typically trained on web-scale text corpora dominated by clean digital Korean, with limited exposure to small-font Hangul, mixed-script retail codes, thermally degraded numeric strings, or receipt-style tabular layouts. This mismatch significantly limits their zero-shot extraction capability on KORIE. Future work will include domain-adapted fine-tuning and an isolated evaluation of numeric OCR accuracy to more precisely characterize error sources.
6. Conclusions and Future Work
We introduced KORIE, a Korean-language receipt benchmark for text detection, OCR, and information extraction under realistic acquisition conditions. The dataset comprises 748 scanned thermal receipts with bounding-box annotations, 17,587 OCR crops, and 2886 item- and field-level IE instances, focusing on dense, small text, mixed Hangul–Latin scripts, and artifacts such as blur, skew, and thermal banding. We benchmarked weakly supervised WSOL methods, transformer-based DINO detectors, and YOLOv9/v10/v11, as well as OCR engines (Tesseract/pytesseract, EasyOCR, PaddleOCR, and an attention-based BiGRU) and zero-shot instruction-tuned LLMs (Llama 3.x and Qwen2.5) for IE. YOLOv11 and PaddleOCR emerge as strong detection/OCR baselines, while zero-shot LLMs achieve only modest item-level F1 (around 25%) and largely fail on price-related fields, reflecting both compounded detection–OCR–IE errors and a strong domain mismatch, as current LLMs are not trained on noisy Korean receipt text or thermal-print numeric conventions.
Our study has several limitations. KORIE is moderate in scale and focused on Korean retail receipts scanned at 300 dpi, so results may not fully transfer to other document types, languages, or more extreme scanning conditions. The baseline suite, while diverse, is not exhaustive, and IE is evaluated only in a zero-shot setting with strict exact matching for numeric fields, which does not credit near-misses or partially correct values.
Future work includes enlarging and diversifying the dataset (e.g., more stores, devices, and degraded receipts), enriching annotations with higher-level semantics (such as vendor categories and discount structures), and exploring few-shot and fully fine-tuned IE models, layout-aware architectures, and OCR-free or multi-task approaches that more tightly couple visual, textual, and numerical reasoning. We release KORIE together with baseline models, OCR crops, IE annotations, and evaluation code to support reproducible research on Korean receipt understanding and to encourage further advances in small-text detection, Hangul-aware recognition, and robust information extraction.