Next Article in Journal
Analytical Modeling of an Ironless Axial Flux Machine for Sizing Purposes
Previous Article in Journal
Two Novel Quantum Steganography Algorithms Based on LSB for Multichannel Floating-Point Quantum Representation of Digital Signals
Previous Article in Special Issue
Multimedia Graph Codes for Fast and Semantic Retrieval-Augmented Generation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Keyword Spotting via NLP-Based Re-Ranking: Leveraging Semantic Relevance Feedback in the Handwritten Domain

by
Stergios Papazis
,
Angelos P. Giotis
*,† and
Christophoros Nikou
Department of Computer Science and Engineering (CSE), University of Ioannina, 45110 Ioannina, Greece
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Electronics 2025, 14(14), 2900; https://doi.org/10.3390/electronics14142900
Submission received: 29 May 2025 / Revised: 12 July 2025 / Accepted: 18 July 2025 / Published: 20 July 2025
(This article belongs to the Special Issue AI Synergy: Vision, Language, and Modality)

Abstract

Handwritten Keyword Spotting (KWS) remains a challenging task, particularly in segmentation-free scenarios where word images must be retrieved and ranked based on their similarity to a query without relying on prior page-level segmentation. Traditional KWS methods primarily focus on visual similarity, often overlooking the underlying semantic relationships between words. In this work, we propose a novel NLP-driven re-ranking approach that refines the initial ranked lists produced by state-of-the-art KWS models. By leveraging semantic embeddings from pre-trained BERT-like Large Language Models (LLMs, e.g., RoBERTa, MPNet, and MiniLM), we introduce a relevance feedback mechanism that improves both verbatim and semantic keyword spotting. Our framework operates in two stages: (1) projecting retrieved word image transcriptions into a semantic space via LLMs and (2) re-ranking the retrieval list using a weighted combination of semantic and exact relevance scores based on pairwise similarities with the query. We evaluate our approach on the widely used George Washington (GW) and IAM collections using two cutting-edge segmentation-free KWS models, which are further integrated into our proposed pipeline. Our results show consistent gains in Mean Average Precision (mAP), with improvements of up to 2.3 % (from 94.3 % to 96.6 % ) on GW and 3 % (from 79.15 % to 82.12 % ) on IAM. Even when mAP gains are smaller, qualitative improvements emerge: semantically relevant but inexact matches are retrieved more frequently without compromising exact match recall. We further examine the effect of fine-tuning transformer-based OCR (TrOCR) models on historical GW data to align textual and visual features more effectively. Overall, our findings suggest that semantic feedback can enhance retrieval effectiveness in KWS pipelines, paving the way for lightweight hybrid vision-language approaches in handwritten document analysis.

1. Introduction

In recent years, the demand for efficient information retrieval from large-scale historical document collections has spurred significant interest in recognition-free document indexing. Traditional Optical Character Recognition (OCR) systems often fall short when faced with degraded, cursive, or stylistically inconsistent handwriting found in archival material. To overcome these limitations, Keyword Spotting (KWS) [1] has emerged over the past two decades as a powerful alternative. It enables the retrieval of instances of a given word from document images without requiring full transcription. KWS systems bypass full recognition either by comparing image representations of words with a reference query word image representation (Query-by-Example, QbE) or by textual embeddings, typically expressed as latent representations of visual features among the query string and its potential instances (Query-by-String, QbS).
Among various categorizations of the KWS paradigm, we can distinguish two primary directions. Segmentation-based approaches assume document page segmentation at the line or word image level, prior to feature extraction and representation. Instead, segmentation-free approaches operate directly on full document pages. While segmentation-based techniques [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23] benefit from a more controlled setting, they rely heavily on accurate preprocessing and annotations, limiting their scalability. On the contrary, segmentation-free methods [24,25,26,27,28,29,30,31,32,33] offer a more robust and scalable alternative, especially in zero-shot or low-supervision regimes.
Despite their advancements, current KWS systems have approached a saturation point in leveraging the visual domain alone. Deep neural architectures have significantly improved word image representation. However, they often struggle to capture semantically related words that are visually dissimilar within the document space. This motivates the integration of Natural Language Processing (NLP) techniques as a complementary semantic layer on top of the visual representations.

1.1. Motivation

The central motivation of this work lies in the observation that many top-ranked instances retrieved by state-of-the-art KWS models are either visually misleading or semantically weak matches. As such, there is an increasing need to amplify KWS performance by embedding language-aware mechanisms that can exploit semantic proximity between the query and the retrieved results. Recent advancements in language models, such as Word2Vec [34], FastText [35], and BERT [36], allow for rich contextual representations that encode semantic similarity beyond surface-level matching.
In this work, we introduce an unsupervised semantic relevance feedback mechanism that operates in the latent space of word meanings. By projecting the transcription of each retrieved instance into a high-dimensional semantic space, we evaluate its relevance with respect to the query and re-rank the list accordingly. This re-ranking is driven by a hybrid similarity score that incorporates both the original visual similarity (verbatim) and the semantic proximity in the language domain using their weighted sum (Equation (6)). Our aim is to improve the Mean Average Precision (mAP) of string-based queries without requiring manual annotation or supervised fine-tuning.

1.2. Contribution

The key contributions of this work can be summarized as follows.
  • We present a novel framework for semantic-aware re-ranking in handwritten document image KWS by integrating cutting-edge NLP techniques into the retrieval pipeline. To the best of our knowledge, this is the first attempt to incorporate transformer-based Large Language Models (LLMs) for uncovering semantic relationships with top-ranked query instances, enabling post-retrieval re-ranking that enhances overall KWS performance.
  • We explore two distinct decoding strategies for transcribing word image instances from the initial ranked list: (i) TrOCR [37], a transformer-based vision-to-text model that maps visual input directly to character sequences, and (ii) the character counting and Connectionist Temporal Classification (CTC)-based re-scoring mechanism introduced in [33], a compact Convolutional Neural Network (CNN)-based segmentation-free approach that scores query matches based on dense character probability maps and sequence alignment. These transcriptions are subsequently embedded using state-of-the-art language models (e.g., RoBERTa [38]), projecting each word into a semantically meaningful vector space for downstream re-ranking.
  • We propose a new ranking scheme where each word instance is assigned a composite score: a weighted sum of its verbatim (visual-based) similarity and its semantic similarity to the query. This allows semantically relevant words, possibly missed in the initial visual ranking, to be spotted higher in the final ordering.
  • We perform extensive ablation studies on the George Washington (GW) and IAM datasets using two cutting-edge reference KWS systems that perform directly at page level, namely, KWS-Simplified [33] and WordRetrievalNet [29], evaluating improvements in mAP across different embedding strategies and similarity metrics.
  • Finally, our results illustrate that NLP-based re-ranking not only improves standard KWS performance but also opens the path toward semantically aware information retrieval systems. These systems bridge the gap between visually dissimilar yet semantically related word instances while maintaining a plug-and-play design that is both dataset-agnostic and independent of baseline model retraining. This generalization capability is further supported by the low variability in re-ranking effectiveness across cross-validation folds. Throughout our experiments, we observe consistently improved mAP and low standard deviation. Taken together, our approach pushes the frontier of recognition-free document understanding by integrating semantic reasoning into post-retrieval analysis.
Our work is organized as follows: Section 2 reviews the related works with regard to document-level keyword spotting, focusing on segmentation-free approaches where processing is performed directly on the page image without requiring explicit word or line segmentation. In addition, we cover related attempts within the relevance feedback domain and re-ranking schemes that have inspired our proposed methodology. Section 3 details the baseline KWS reference systems, their integrated architecture within the NLP-based techniques used to extract semantic similarities from the initial exact keyword matches that can improve the overall performance. We also introduce a novel re-ranking scheme that balances verbatim and semantic KWS in a unified framework. Section 4 presents extensive ablation studies validating the impact of semantic relevance on retrieval performance, followed by a deeper discussion and interpretation of our findings, which confirm our initial intuition in this respect. Finally, Section 5 summarizes our contributions and outlines potential directions for future research.

2. Related Works

2.1. Segmentation-Free Keyword Spotting: Key Approaches and Trends

One of the major issues of the preprocessing stage is that possible segmentation errors are regularly conveyed in the spotting phase. In particular, accurate word segmentations are difficult to obtain in handwritten and degraded documents. For this reason, several segmentation-free word spotting techniques have emerged.
Early segmentation-free KWS methods addressed the problem of avoiding explicit word or line segmentation by analyzing entire document pages directly. This direction was initially dominated by hand-crafted feature extraction and region-of-interest proposals. Leydier et al. [39] and Zhang and Tan [40] used local keypoints and gradient-based descriptors or Heat Kernel Signatures (HKSs), matched through elastic or manifold-based similarity metrics. However, these techniques incurred high computational costs and did not scale well.
A more scalable direction emerged with patch-based sliding-window frameworks [24,41,42,43], where descriptors like Scale-Invariant Feature Transform (SIFT), Histogram of Gradients (HoG), or pixel densities are extracted over image regions. In this line, Rusiñol et al. [24] enhanced retrieval effectiveness via a Latent Semantic Indexing (LSI) projection, while Rothacker et al. [42] used a Bag-of-Features Hidden Markov Model (BoF-HMM) formulation for robust query modeling.
Graph- and component-based techniques [26,44,45] modeled spatial or structural properties of documents using connected components (CCs) or grapheme graphs, typically matched using graph edit distances or geometric constraints. Kovalchuk et al. [26], winner of the segmentation-free track in the ICFHR 2014 KWS competition, proposed a fast and simple CC-based matching scheme. Other approaches such as [44,45] explored contextual or geometric graph structures, often leveraging binarized document representations. While segmentation-free in design, these methods were sensitive to noise and relied on handcrafted heuristics for constructing word hypotheses.
These early systems, while pioneering, were progressively replaced by deep learning-based methods, which enabled the end-to-end learning of discriminative features and more efficient document-wide inference. Early deep approaches often relied on region proposal mechanisms to locate candidate word regions, but more recent work has shifted toward fully dense or proposal-free retrieval architectures.
Ghosh and Valveny [46] combined region-proposal CNNs with attribute-based deep embeddings (PHOCNet representations [3]) to aggregate features across word-like regions. Wilkinson et al. [28] introduced Ctrl-F-Net, an end-to-end architecture that employs a ResNet34 backbone, Region Proposal Networks (RPNs), and word string embeddings such as the Pyramidal Histogram of Characters (PHOC) [2] and Discrete Cosine Transform of Words (DCToW) [47], to enable robust QbS retrieval. Rothacker et al. [48] further enhanced region detection under uncertainty by incorporating extremal region proposals and class activation maps into the word spotting pipeline.
At the multi-task and multi-scale learning frontier, Zhao et al. [29] integrated a Feature Pyramid Network (FPN) into a CNN architecture, jointly training for pixel classification, bounding box regression, and visual-to-textual embedding learning (via DCToW). These contributions have advanced the segmentation-free paradigm toward dense, discriminative, and scalable retrieval pipelines. Complementary efforts by Wilkinson and Nettelblad [30] introduced a weakly supervised bootstrapping strategy using HMM-based alignment, while Retsinas et al. [33] proposed an efficient proposal-free approach based on character counting and CTC-based re-scoring.
While most prior works in this area focus on visual representations and similarity, recent trends point toward bridging the gap between visual and semantic domains using embeddings that encode language-aware properties [49]. However, such semantic alignment has remained under-explored in segmentation-free KWS settings. This motivates our work, which aims to enhance KWS effectiveness by introducing relevance-aware re-ranking based on language models and semantic embeddings of word image transcriptions.
In what follows, we review relevant literature in retrieval enhancement, particularly focusing on relevance feedback and re-ranking schemes that aim to improve the final ranked list beyond raw visual similarity. This includes both supervised and unsupervised paradigms and sets the foundation for our proposed embedding-based re-ranking method that leverages NLP-driven semantic proximity in the re-ranking process.

2.2. Retrieval Enhancement via Relevance Feedback and Re-Ranking

A prominent challenge in KWS systems is the presence of false positives within the top-ranked retrieval results. To address this, several techniques for refining the ranked list have been proposed, commonly grouped under the umbrella of relevance feedback. These can be broadly classified into supervised and unsupervised methods.

2.2.1. Supervised Relevance Feedback

In supervised relevance feedback (SRF), the user labels a small subset of the retrieval results as relevant or irrelevant, enabling the system to reformulate the query representation or re-rank the results accordingly. Rocchio-style vector updates [50] have been adapted for KWS by Rusiñol et al. [51], combining query refinement with score re-weighting. More recently, Wolf et al. [52] introduced a CNN-based confidence estimation pipeline, employing dropout-based uncertainty, surrogate meta-classifiers, and sigmoid-activated trust scores to identify and suppress false positives in segmentation-free word spotting.
Despite the rapid evolution of relevance feedback techniques in vision-language retrieval, such as CLIP-based interactive search [53], approaches of this type have not yet been adopted by the document analysis and recognition community. In fact, to the best of our knowledge, no new supervised feedback strategies have been introduced in segmentation-free KWS settings since 2020. This highlights a critical research gap and further motivates our embedding-based re-ranking proposal.

2.2.2. Unsupervised Feedback and Re-Ranking

The obvious benefits of supervised relevance feedback lie in facilitating user judgments to guide ranking refinement. However, this process can be costly and subjective, particularly in degraded or historical documents. This limitation has motivated the adoption of unsupervised alternatives, such as pseudo-relevance feedback (PRF). Almazán et al. [43] introduces a two-stage ranking, combining fast approximate retrieval with Fisher vector-based re-ranking and iterative query expansion. Similar approaches have been employed by Ghosh and Valveny [54] and Shekhar and Jawahar [55], who incorporate spatial pyramid refinements. Vats and Fornés [56] propose a local query expansion strategy based on confidence thresholds and keypoint matching, repeated across document pages to enhance robustness.
While these methods have advanced re-ranking pipelines in segmentation-free KWS, they typically operate on low-level visual similarity. Hence, they fail to capture deeper semantic relations, which are particularly important when dealing with ambiguous visual forms or out-of-vocabulary (OOV) terms. This motivates a shift toward semantic reasoning via language-aware models.
Recent advancements in general information retrieval (IR), including query expansion via neural embeddings, transformer-based re-ranking, and interactive CLIP-style search, have demonstrated substantial gains in retrieval quality. However, such methods often target textual corpora, document-level retrieval, or long natural language queries. Therefore, they are not directly applicable to the constrained setting of segmentation-free word image retrieval. Notable examples include re-ranking in Question Answering (QA) pipelines [57], semantic query expansion with conditional Variational Autoencoders (VAEs) [58], and surveys of deep Query Expansion (QE) techniques [59], all of which focus on textual or structured data rather than single-word visual spotting tasks.
For this reason, and to maintain conceptual coherence, we restrict our focus to visual-domain methods that have been explicitly applied within the keyword spotting literature. We next propose an alternative, embedding-based re-ranking scheme that introduces language-informed semantic priors into the unsupervised refinement process.

2.3. Semantic Knowledge Transfer

In a traditional KWS system, document regions are represented using descriptors such as PHOC or DCToW and matched against a query based solely on visual appearance and character-level similarity. However, words are more than sequences of characters, since they also carry semantic meaning. Even humans might struggle in interpreting handwritten text, and as such, often rely on contextual understanding of the broader context to disambiguate visually similar words. Figure 1 showcases an example of this peculiarity where a human transcriber has to interpret letters forming a valid word (i.e., ten), while that word should fit logically within the surrounding sentence and passage. This intuition motivates the integration of semantic reasoning into word spotting systems.
Semantic KWS was first introduced by Wilkinson et al. [47] as a method to enrich document image search with language-level knowledge. Their approach retrieved exact matches and expanded them using semantically related terms (e.g., synonyms, inflected forms, or categorical relationships). This form of semantic expansion (expressed as inexact or partial matching) has proved particularly useful for dealing with hyphenation, word splits across line breaks, or approximate matches.
Semantic retrieval methods can be broadly divided into ontology-based and context-based techniques [60]. Ontology-based approaches [61,62] use resources such as WordNet [63] to identify categorical or lexical similarities.
On the contrary, context-based approaches are grounded in the distributional hypothesis [64], which posits that words appearing in similar contexts tend to share similar meanings. Recent advancements in NLP have produced a wide range of text embedding techniques, including Word2Vec, GloVe, FastText, and transformer-based models like BERT, which generate dense vector representations that encode semantic similarity beyond surface-level string matching. These embeddings have opened new opportunities for mapping visual content, such as handwritten words, into semantically meaningful spaces.
To bring this semantic capability into keyword spotting, two general strategies have emerged: (1) learning visual-to-semantic mappings directly in an end-to-end, recognition-free manner, and (2) transcribing word images first, followed by text-based embedding to obtain semantic representations.
End-to-end approaches are particularly common in segmentation-based settings, where they circumvent explicit recognition; a process known to yield irrecoverable errors when mapped directly to embedding spaces. This paradigm was introduced by Wilkinson et al. [47], who trained a two-stage CNN with cosine embedding loss to project word images into a pre-trained semantic space. Subsequent work by Krishnan et al. [65] extended this approach using the HWNet architecture, recently evolved into a joint embedding framework (HWNet v3) for recognition and retrieval [19] to jointly learn syntactic (PHOC or DCToW) and semantic (e.g., FastText) representations. Tüselmann et al. [49] further evaluated the impact of different embeddings—including FastText, GloVe, and BERT—using the same architecture and explored combinations thereof for document-level semantic understanding.
These learned embeddings have enabled downstream tasks such as Named Entity Recognition (NER), Visual Question Answering (VQA), and Named Entity Linking (NEL) directly on image documents. However, direct embedding methods remain limited in practice: they require large amounts of annotated training data, and no publicly available pre-trained models currently exist for generic semantic KWS. For example, Wilkinson et al. [47] suggest training on all 40 volumes of The Writings of George Washington from the Original Manuscript Sources, 1745–1799 [66] to adequately capture corpus-level semantics. Other works [19,49] rely on large-scale synthetic datasets of handwritten word images rendered from frequent English words.
Despite these efforts, recent analysis [60] reveals that visual–semantic embeddings often retain mostly syntactic characteristics. This suggests a gap between visual and semantic domains and highlights the underutilized potential of modern pre-trained language models in keyword spotting. To address this, we propose an alternative path: leveraging pre-trained transformer-based language models (RoBERTa, MPNet [67], and MiniLM [68]) to inject semantic reasoning into the KWS pipeline through post-hoc re-ranking. By embedding transcriptions of retrieved word images and comparing them to the query in a semantic space, we enable relevance-aware refinement that complements visual similarity. This lightweight, flexible mechanism enhances both exact and semantic KWS, bridging the gap between vision and language without requiring end-to-end retraining.

3. Materials and Methods

In this section, we introduce a post-processing, semantically-aware relevance feedback mechanism aimed at enhancing the retrieval performance of KWS systems. We begin by describing two state-of-the-art KWS systems in Section 3.1 and Section 3.2, each embodying a distinct approach to traditional segmentation-free KWS, with their ranked lists serving as the basis for our re-ranking. To conclude, we present our proposed framework in Section 3.3.
An overview of the proposed architecture is illustrated in Figure 2. Given a query, the KWS system retrieves the top-k most visually similar image regions from the document collection. Each retrieved word image is decoded, and an LLM projects both its transcription and the query in a shared subspace to compute their semantic similarity. The final ranking is determined by combining both similarity measures.
Additionally, the re-ranking capability of the proposed pipeline extends beyond conventional KWS objectives. As shown in Figure 2, visually similar but irrelevant terms (e.g., “letter”, “understand”, “better”) are replaced with semantically relevant military terms (e.g.,“sergeant”, “regiment”). Although this could yield more meaningful retrieval results for users, such qualitative improvements are not reflected in verbatim KWS metrics.

3.1. WordRetrievalNet

The WordRetrievalNet is a state-of-the-art segmentation-free QbS KWS system introduced in [29]. It operates in two stages:
  • Offline stage: A deep neural network (DNN) is trained to generate a database of candidate bounding boxes along with their representations in a latent space.
  • Online stage: A query is matched against the database, returning a ranked list of bounding boxes with the highest cosine similarity to the query representation.
Its end-to-end design avoids the complex pre- and post-processing steps required by traditional KWS methods. Additionally, the FPN-based architecture [69,70] extracts multi-scale features directly from document images, which are processed by three prediction heads:
  • A classification head that identifies pixels belonging to a positive word area;
  • A regression head that predicts the offsets between each pixel of a positive word area and the corresponding bounding box containing it;
  • An embedding head that maps word areas into the latent space (DCToW, PHOC, etc.).
The network is trained in a supervised manner, minimizing the loss function:
L a l l = L c l s + L b b o x + L e m b e d .
For the classification task, a loss based on the Dice similarity coefficient [70,71] is used:
L c l s = 1 2 i , j y ^ c l s i , j · y c l s i , j i , j y ^ c l s i , j 2 + i , j y c l s i , j 2 ,
where y ^ c l s i , j , y c l s i , j denote the values of pixel ( i , j ) in the word classification prediction y ^ c l s and the ground truth y c l s , respectively. This loss function counteracts the bias introduced by the imbalance between word pixels and background pixels.
For the bounding box regression, a modified Intersection over Union (IoU) loss, called distance–IoU loss ( D I o U ) [72], is employed, as expressed by Equation (3):
L b b o x = 1 C i C D I o U y ^ b b o x , y b b o x ,
where C represents the collection of positive elements in the word pixel classification, y ^ b b o x is the predicted bounding box, and y b b o x is the ground truth box.
For the word embedding, the cosine loss is used L e m b e d = 1 cos y ^ e m b e d , y e m b e d , which penalizes the dissimilarity between the representations y ^ e m b e d and y e m b e d .
During inference, the set of positive word pixels and their offsets are combined to construct bounding boxes for the candidate word areas. A Non-Maximum Suppression (NMS) filter is applied to reduce the density of predictions. The embedding of each bounding box is computed as the mean embedding of the pixels it contains. Figure 3 summarizes the above-mentioned key model components.

3.2. Segmentation-Free KWS Simplified

Retsinas et al. [33] introduce a segmentation-free QbS KWS system (KWS-Simplified) that formulates KWS as a character counting problem. Their goal is to identify image regions containing the same character histogram as the query. Unlike WordRetrievalNet, the lightweight system architecture eliminates synthetic training data requirements. However, to refine the initial predictions and compute their similarity with the query, several post-processing steps are applied. These steps include (1) Pyramidal Counting, where a descriptor similar to PHOC is constructed and compared against the query descriptor, and (2) CTC-based re-scoring via force alignment. This alignment approach is used to improve the bounding box estimation and refine the ranking of candidates, in line with recent developments in CTC alignment and scoring [73]. NMS is further applied to reduce overlapping predictions.
The system consists of a ResNet [74] backbone with two prediction heads: (1) a decoder head that estimates the probability distribution of each character occurrence across a document image and (2) a scaler head that predicts the character scale at each image location. Figure 4 depicts this architecture.
For a given character c, let F ( i , j , c ) denote the feature probability distribution output by the decoder head at image coordinates ( i , j ) and let S ( i , j ) represent the predicted scale factor at the same coordinates. The number of occurrences of character c within a bounding box spanning from ( s 1 , s 2 ) to ( e 1 , e 2 ) is given by Equation (4):
y c = i = s 1 e 1 j = s 2 e 2 F ( i , j , c ) · S ( i , j )
The counting loss is defined as L c o u n t = y c t c 2 , where t c is the target count histogram.
During the training of the network, the loss function employed combines the counting loss with the CTC loss [75]. Following the approach of Retsinas et al. [33], a weighting factor of 10 is applied to the counting loss to balance its contribution with the CTC objective:
L = L CTC + 10 · L count .
Herein, the feature map produced by the decoder head undergoes column-wise max-pooling before being fed into the CTC loss. Figure 5 overviews these post-processing scoring steps.

3.3. Proposed Framework

We conclude this section by presenting our proposed framework: an unsupervised relevance feedback pipeline enabled by the incorporation of semantic information during the re-ranking process. Its goal is to enhance retrieval performance through the suppression of false positive results that appear high in the initial ranking, while simultaneously promoting instances whose original rank underestimates their actual relevance.
The proposed framework operates through three stages. Initially, a verbatim retrieval step takes place; a KWS system generates a ranked list of candidate image regions, ordered by their visual resemblance to a given query. Then, a decoder transcribes each image region (retrieved query instance) from the list. Subsequently, a semantically aware LLM [38] embeds these transcriptions into a semantic space, where spatial proximity reflects semantic relatedness. The final ranking occurs, wherein candidates are reordered based on their combined verbatim and semantic similarity.
We examine multiple variations of the framework:
  • Two distinct decoder architectures;
  • Three alternative state-of-the-art semantic LLMs;
  • Two late fusion strategies.
The first decoder is adapted from the KWS-Simplified network. Notably, its character probability prediction head can operate as an independent module that generates transcriptions when a softmax function is applied to its output. For the second decoder, we utilize TrOCR, a state-of-the-art text recognition model. It integrates a Vision Transformer (ViT) encoder [76] with a transformer-based text generator (typically initialized using either RoBERTa or MiniLM). The combined model has been pre-trained on large-scale synthetic textual data and can be fine-tuned on both machine-printed and handwritten document collections.
For the generation of semantic embeddings, we use three state-of-the-art BERT-like LLMs derived from RoBERTa, MPNet, and MiniLM, respectively. Each of these models has been specifically adapted for the task of semantic search through supervised contrastive learning [38]. This paradigm trains models to differentiate between semantically related sentence pairs and randomly sampled negatives. In order to handle OOV terms, while maintaining fixed-dimensional embeddings, the WordPiece [36] subword tokenization technique is used. The final embedding is obtained by mean pooling all token embeddings. Semantic similarity is quantified as the cosine similarity between the generated embedding vectors. Mean pooling is chosen for its increased robustness against transcription errors in individual subtokens.
Finally, we evaluate two strategies that fuse visual and semantic relevance: weighted combination of similarities and semantic pruning. In the weighted combination strategy, we compute a weighted average of the verbatim and semantic similarity scores using Equation (6):
sim combined = a · sim semantic + ( 1 a ) · sim verbatim ,
where a [ 0 , 1 ] is a hyperparameter controlling the relative importance of semantic similarity in the final score. The resulting values are used to re-rank the candidate list, incorporating both semantic and verbatim relevance.
On the other hand, semantic pruning refers to the process of filtering candidate items by discarding those with a semantic similarity below a predefined threshold. The remaining candidates are then re-ranked based on their verbatim similarity.

4. Experimental Evaluation

4.1. Datasets

We conducted our experiments on two standard benchmarks for evaluating and comparing KWS systems [1]: the George Washington (GW) dataset [77] and as well as the IAM Handwriting Database (IAM) [78].
The GW database comprises 20 handwritten letters from George Washington’s Papers at the Library of Congress [79]. These 18th-century documents, written in historical English by Washington and his aides, contain 4860 annotated words with corresponding bounding boxes. Due to the minimal variation in the writing style, it can be characterized as a single-writer dataset.
Unlike segmentation-based KWS approaches that employ the standardized partition of Almazán et al. [2] for the GW collection, no official partition exists for the evaluation of segmentation-free methods. Instead, we follow the established experimental practices used in prior works [27,28,29,33]. Given its limited size, it is customary to adopt a four-fold cross-validation scheme, where each fold consists of five pages. Thus, four experimental iterations are conducted. During each iteration, one fold is reserved as the test set, while the remaining three serve as the training set. Additionally, one page from the training set is set aside for validation. Test queries are extracted from the unique transcriptions of the test pages by removing punctuation and lowercasing, whereas stopwords are retained as queries. Table 1 presents the exact partition of the dataset used in our experiments (the partition was obtained by shuffling the page indices 0–19 using NumPy with seed 0).
The IAM database contains 1539 pages of modern cursive handwritten English text produced by 657 writers. The pages are segmented and annotated, comprising a total of 115,320 words. The variability introduced by the multi-writer setting is a principal factor contributing to the difficulty of the dataset. We use the official partition of the database, as is common practice in the literature [28,33]; however, unlike GW, no cross-validation is performed. The test queries comprise all unique, lowercased transcriptions from the test set, excluding words that contain non-alphanumeric characters, punctuation marks, erroneous annotations, or words appearing in the official stopword list [2,3].

4.2. Evaluation Protocol

We follow the evaluation procedure outlined in [27], which extends the standard Almazán protocol [2] for assessing the performance of segmentation-free QbS KWS methods [28,29,33].
To evaluate the ranking of retrieved document regions, we compute the Mean Average Precision (mAP, a commonly used metric in KWS evaluation. As the name suggests, mAP is calculated as the mean of the Average Precision (AP) across a set of given test queries. The AP for a given query is defined in Equation (7):
A P = 1 R k = 1 n P @ k · r e l ( k ) ,
where P @ k is the precision of the top k retrieved results (i.e., the fraction of the top k retrieved instances that are relevant), r e l ( k ) is an indicator function that returns 1 if the retrieved instance at index k is relevant and 0 otherwise, and R denotes the total number of relevant instances for the query. Finally, in the segmentation-free KWS setting, a retrieved region is considered relevant if it sufficiently overlaps with an annotated region in the ground truth. Overlap is measured using the Intersection over Union (IoU) criterion, which is satisfied when it exceeds a predefined threshold. Commonly used thresholds, which we also employ in our study, are 25 % (mAP@25) and 50 % (mAP@50). If multiple retrieved regions overlap with a single ground truth region, only one is considered relevant—typically, the one with the largest overlap.

4.3. Implementation Details

Both WordRetrievalNet and KWS-Simplified offer open-source implementations, as well as pre-trained models. However, while KWS-Simplified includes training on the IAM dataset, neither system has been pre-trained on GW.
In order to enable a direct comparison between these architectures and to establish a rigorous baseline against which we evaluate the relative gains of our framework, we trained both systems on GW using the typical four-fold cross-validation scheme described in Section 4.1 and the partition shown in Table 1. For each cross-validation iteration, we trained a WordRetrievalNet model for 120 epochs and a KWS-Simplified model for 200 epochs, evaluating mAP@25 performance on the validation set every 10 epochs and retaining the best-performing model as the final baseline. The training configuration of each system (e.g., optimizer selection, hyperparameter values) followed the settings reported by Zhao et al. [29] and Retsinas et al. [33] as those yielding the highest mAP performance. The reproduced and reported mAP scores are recorded in Table 2.
Similarly, WordRetrievalNet was trained on the official IAM split for 80 epochs, achieving an mAP@25 of 79.15 % and an mAP@50 of 72.85 % . It is worth noting that Zhao et al. [29] do not report results for this dataset.
While our reproduced results deviate from prior works, our goal is to establish a baseline reference system, and therefore, such observed divergences are to be expected and can be attributed to several factors.
First and foremost, the original implementations used different partitions of GW. In the case of WordRetrievalNet, an additional source of variability arises from the use of randomly generated synthetic training data.
Second, we note minor implementation-specific differences in the training data preparation pipelines. Since both systems are segmentation-free, they employ a detection phase in which the model identifies image regions that are likely to contain words. These candidate regions are then refined to match the query and produce the final retrieval results. To train such a detector effectively, it is not sufficient to use only the tight bounding boxes of the ground truth word images. Instead, it is beneficial to enlarge these regions so it can learn to segment words more reliably. Overestimating the bounding box is generally preferable to underestimating it, as missing a word entirely would hinder recall.
In the training of KWS-Simplified on GW, the authors generate training images by extending the ground truth bounding boxes by a constant factor. In our implementation, we enlarged the word image areas by a factor of 2.5. We observed this larger context window helped the model learn more robust representations corresponding to the segmented word regions, which in turn led to a substantial improvement in mAP@50.
Furthermore, the implementation of WordRetrievalNet uses training data extracted as image crops that cover regions larger than the ground truth word bounding boxes, often including multiple word instances. We note here that recovering the exact parameterization that yielded the optional result in the original work was not possible. Therefore, we modified the patch cropping algorithm used for extracting positive examples during the training of WordRetrievalNet, aiming to improve the computational efficiency of the training loop. The original algorithm repeatedly sampled image patches until no word within a patch crossed its boundary or until a limit of 1000 iterations was reached. In practice, this limit was frequently exhausted, creating an artificial bottleneck that significantly slowed down training and rendered the computation CPU-bound. Our modified approach addresses this issue by retaining only the words for which at least 70% of the bounding box falls within the sampled patch. This trade-off between computational efficiency and training quality was preferable, especially within the mAP@50 metric, where the impact is more noticeable.
Finally, some variability is inevitably introduced by the stochastic nature of neural network training, such as the random weight initialization and the random sampling during optimization.
As discussed earlier in Section 3.3, we evaluate two distinct decoder architectures for transcribing retrieved image regions. Each KWS-Simplified-based decoder inherently shares weights with the corresponding KWS-Simplified backbone trained on the same partition, thereby requiring no additional training before deployment. For the TrOCR-based alternative, we initialized the model in each cross-validation iteration of GW using weights from a handwritten text recognition (HTR) model [80] pre-trained on the IAM database. Next, the model was fine-tuned for 20 epochs using the set of word images from the training set of the current iteration, while the queries from one page were used for validation. We employed the AdamW optimization algorithm [81] with an initial learning rate of 5 × 10 5 , which increased linearly during a short warm-up period and then decayed linearly for the remaining epochs. The fine-tuning reduced the average Character Error Rate (CER) on the validation sets of GW from 26.76% to 11.05% (i.e., a standard metric in OCR and HTR, which measures the number of character-level errors in a predicted transcription compared to the ground truth). Although the strong 3.42 % CER performance reported by TrOCR on IAM [37] suggests that there is room for improvement on GW, accurate retrieval does not necessarily require explicit transcription. We report CER solely to support reproducibility. Moreover, given this performance, we used the decoder in our experiments on IAM without further training.
To embed the decoded transcriptions into a semantic space, we employed three pre-trained models from the SentenceTransformers library [82]: (1) stsb-roberta-base [83] (the RoBERTa architecture fine-tuned on the Semantic Textual Similarity benchmark [84]); (2) all-mpnet-base-v2 [85] (the MPNet architecture fine-tuned on a corpus of diverse datasets comprising over one billion sentence pairs); and (3) all-MiniLM-L12-v2 [86] (the MiniLM architecture fine-tuned on the same diverse corpus).
Finally, both WordRetrievalNet and the three semantic embedding models utilize cosine distance to compute the verbatim and semantic similarities, respectively. In comparison, the KWS-Simplified method relies on a similarity measure derived from CTC scores. To ensure that these values are on a comparable scale when combined, we normalize the CTC-based scores to the range [ 1 , 1 ] using the minimum and maximum values observed in the training set.

4.4. Ablation Experiments on Fusion Strategy and Decoder Choice

This section presents ablation studies conducted to evaluate the impact of different configurations on the proposed pipeline. These experiments aim to isolate the effects of the re-ranking strategy and assess the generalization ability of our method across different initial retrieval systems. Table 3 and Table 4 present the numerical results obtained after re-ranking the initial WordRetrievalNet and KWS-Simplified ranking lists using the proposed weighted combination strategy. These results confirm our intuition that the combination of verbatim and semantic information actually enhances the performance. On the contrary, Table 5 shows the results for the alternative strategy based on semantic pruning. Despite its aim to alleviate the retrieval by identifying semantically irrelevant instances, the strategy is overly simplistic and effectively hampers system performance. The performance on GW is reported as the average mAP@25 and mAP@50 across the experimental iterations, along with the corresponding standard deviations (SDs) across the four trials. The contribution of each pipeline component is discussed in detail below.

4.4.1. Impact of the Baseline KWS Model

The effect of our semantic re-ranking pipeline varies depending on the baseline KWS model. On the one hand, WordRetrievalNet exhibits stronger improvements compared to the KWS-Simplified variant. As shown in Table 3 and Figure 6, the gains on the GW dataset reach 2.3 % in mAP@25 (from 94.31% to 96.59%) and 0.9 % in mAP@50 (from 88.29% to 90.17%). Even more noticeable are the improvements on the IAM dataset, where mAP@25 rises by 3 % (from 79.15% to 82.12%) and mAP@50 by 2.6 % (from 72.85% to 75.43%).
On the other hand, KWS-Simplified achieves smaller gains, as reported in Table 4 and Figure 7. On GW, the re-ranking yields an increase of 1.2 % in mAP@25 (from 89.74% to 90.94%) and 0.7 % in mAP@50 (from 72.29% to 72.97%). On IAM, the respective gains are 1.9 % (mAP@25: 86.40% to 88.25%) and 1.2 % (mAP@50: 63.73% to 64.88%).
These differences can be attributed to the design of each backbone. WordRetrievalNet produces a larger and richer set of candidate detections for each query, since it performs query-independent spotting by precomputing visual features (e.g., DCToW) across the document. This allows the re-ranking module to refine the initial results more effectively, as it has access to more potential matches, including those ranked low in the initial retrieval.
Conversely, KWS-Simplified retrieves a very limited set of candidates, typically just a handful per query. This restricts the impact of re-ranking, as relevant instances not retrieved initially cannot be recovered in later stages. Therefore, KWS-Simplified may benefit from integrating query expansion techniques aimed at enlarging the initial candidate pool, making semantic re-ranking more effective.

4.4.2. Impact of the Decoder

The decoder module directly influences the effectiveness of the re-ranking process, as reflected in the performance variations across datasets and architectures. On GW, the KWS-Simplified decoder yields higher mAP gains than TrOCR, improving mAP@25 by 2.3 % (from 94.31% to 96.59%) and mAP@50 by 0.9 % (from 88.29% to 90.17%). Unlike the first case, TrOCR shows smaller gains: 0.6 % for mAP@25 (from 89.74% to 90.38%) and 0.3 % for mAP@50 (from 72.29% to 72.57%).
This pattern reverses on IAM, where the TrOCR decoder, when paired with WordRetrievalNet, achieves stronger improvements: 3 % (mAP@25: from 79.15 % to 82.16 % ) and 2.6 % (mAP@50: from 72.85 % to 75.43 % ). The KWS-Simplified decoder on the same dataset shows slightly lower gains: 2 % for mAP@25 and 1.6 % for mAP@50. These trends are summarized in Table 3.
Such results are consistent with prior work in KWS: the final performance is shaped not only by the quality of the decoder but also by the accuracy of the initial bounding boxes [87]. Even a strong decoder like TrOCR may struggle on GW due to limited candidate diversity, whereas it benefits more on IAM, where segmentation is more fine-grained and the linguistic space is richer. This highlights the interplay between decoder expressiveness and the underlying retrieval quality.
Ultimately, these findings emphasize the importance of holistic system design. Decoder selection should be based not only on CER or transcription quality but also on how well it complements the retrieval front end. While our current strategy employs a simple late-fusion scheme, it still provides meaningful gains with no additional supervision and minimal computational overhead.

4.4.3. Semantic Embedding Models

We expected that the more diversely trained semantic embedding models (all-MiniLM-L12-v2, all-mpnet-base-v2) would outperform stsb-roberta-base. However, across all model and decoder combinations, we observed at most a 0.5 % difference in both mAP@25 and mAP@50, with performance typically being indistinguishable across GW and IAM benchmarks. This can be attributed to factors such as transcription errors, the reliance on word-level context alone, and dataset limitations (i.e., the small size of GW), all of which prevent the more expressive models from demonstrating their full semantic representational power.

4.4.4. Fusion Strategy: Weighted Combination

The weighted combination of verbatim and semantic relevance scores leads to consistent performance gains, as showcased in Table 3 and Table 4, along with Figure 6 and Figure 7. Nonetheless, the extent of the improvement and the importance of the semantic relevance level depend on the performance of the underlying keyword spotter and the decoder architecture.
On the GW dataset, the already high performance of WordRetrievalNet shifts the optimal semantic relevance weight toward lower values, favoring verbatim relevance, for both decoders. This effect is less pronounced for the KWS-Simplified decoder, where optimal performance is achieved when the semantic weight lies within [0.2, 0.5]. Instead, for the TrOCR-based decoder, as well as the WordRetrievalNet on IAM, the best performance occurs when the semantic importance is minimal, typically within [0.1, 0.2]. Outside these ranges, and particularly when relying solely on semantic similarity, performance deteriorates. This decline indicates that recognition errors introduced during transcription propagate into the semantic space as well. Additionally, it highlights the importance of leveraging both verbatim and semantic signals for effective re-ranking. In each of these cases, the performance drops sharply as the level of semantic importance increases. For instance, in the TrOCR-decoded pipeline, mAP@25 decreases from 94.31 % to approximately 65 % , and mAP@50 drops from 88.29 % to about 62 % on average over all choices of semantic embeddings.
By comparison, when re-ranking the obtained results from the KWS-Simplified backbone, as shown in Table 4, the initial ranking is less accurate, and in turn, the best results are skewed toward the semantic end. In this case, the re-ranking consistently outperforms the baseline for the KWS-Simplified decoder but remains on par with its TrOCR counterpart. Notably, there appears to be a synergy between the KWS-Simplified decoder and the baseline model. Since the decoder is an integral part of the baseline, the initially predicted bounding boxes are optimally aligned for the decoder to use. Consequently, the incorporation of semantic relevance offers clear advantages for this model.

4.4.5. Fusion Strategy: Semantic Pruning

Thus far, we have presented results based on the weighted combination strategy, which consistently delivers the best performance among the tested fusion methods. On the contrary, semantic pruning, a simpler alternative, performs noticeably worse across all configurations, as shown in Table 5 and Figure 8.
The key limitation of semantic pruning lies in its rigid filtering rule, which accepts only candidates exceeding a predefined similarity threshold. While this may boost semantic purity, it often leads to valid instances being discarded due to minor recognition or decoding errors. As a result, retrieval performance deteriorates, especially under stricter thresholds.
Despite its suboptimal results in verbatim keyword spotting, this strategy serves an illustrative role: it highlights the value of combining semantic and lexical signals rather than treating them in isolation. In future applications, semantic pruning might be better suited for tasks where conceptual alignment or query expansion is more critical than exact string matches.

4.4.6. Qualitative Analysis

Previous sections evaluated the quantitative performance of the proposed framework through mAP metrics. We now examine its qualitative benefits: the system successfully augments retrieval with semantic matches while preserving exact lexical matching capabilities. Visual examples demonstrate these enhancements, which transcend standard evaluation indices of verbatim KWS.
Figure 9 shows three top ten ranked lists for the query “forgot”: the first ranked by conventional KWS (verbatim) similarity, the second ranked by semantic similarity, and the last ranked by the combined similarity of our proposed system. For each instance, the top-left corner displays the global rankings using different colors: purple for verbatim, yellow for semantic, and blue for combined similarity. The bottom-right corner shows the similarity scores with the query, following the same color scheme. The ground-truth bounding box is highlighted in green.
The relevant instance corresponding to the query “forgot” is initially ranked tenth, following several instances of “fort”, due to a suboptimal bounding box prediction. Both the semantic ranking and the combined re-ranking successfully identify its true relevance, boosting its score while reducing the scores of visually similar false positives corresponding to “fort”. However, visually similar results still dominate the combined ranking due to a number of factors, including the limited size and variability of GW, as well as significant class imbalance, meaning a considerable number of words either have only a few or entirely lack semantically relevant instances in the dataset. Although the re-ranking process successfully moves the correct instance to the first position, its bounding box remains unchanged. As a result, under stricter overlap thresholds, the correction is effectively ignored, explaining the higher performance improvement in mAP@25 over mAP@50.
Another qualitatively interesting result of the proposed pipeline, potentially useful in recognition-free word-image retrieval, is illustrated in Figure 10. In a similar fashion to the previous example, we present a ranked list for the query “soldiers”. While the vocabulary of the GW dataset generally lacks semantic depth due to its limited size, this effect is less prevalent in certain areas, such as military terminology. In such a scenario, the qualitative benefit is clear: when a user searches for military-related terms (e.g., “soldiers”), it is more likely for the system to retrieve other relevant military terms than it is for it to retrieve unrelated results. Additionally, note that there is only one instance of the word “soldiers” in the dataset. The second instance in the initial verbatim retrieval task is an erroneous duplication of the first, possibly due to suboptimal NMS. The final re-ranking correctly discards this duplicate, a qualitative improvement not effectively measured by mAP.

4.5. Discussion

In a nutshell, our semantic re-ranking framework operates as a modular, plug-and-play extension to existing KWS systems, requiring neither retraining nor dataset-specific adaptation. In contrast with approaches that rely on fine-tuned embeddings or corpus-dependent training, our method is able to generalize effectively across datasets. This generalization capability is further supported by consistent performance gains observed across both datasets. To ensure robustness of the GW results despite the limited size of the dataset, we reported mAP scores averaged over four experimental trials. The consistently low standard deviation observed for all experiments showcases the reliability and stability of the proposed semantic augmentation pipeline. This robustness is further evidenced by the overall effectiveness of the method considering different semantic embedding models, which highlights its adaptability to varying NLP backends.
The broader candidate pool and richer semantic space of IAM allow the semantic re-ranking to manifest more clearly, facilitating its ability to exploit higher-level meaning when more linguistic diversity is available. These results validate the generalization capacity of our method to adapt to heterogeneous data scenarios. Notably, WordRetrievalNet features improvements of 3 % in mAP@25 (from 79.15 % to 82.12 % ) and 2.6 % in mAP@50 (from 72.85 % to 75.43 % ), while KWS-Simplified achieves + 1.85 % (from 86.40 % to 88.25 % ) and + 1.15 % (from 63.73 % to 64.88 % ), respectively.
In our numerical results over GW, we observe consistent improvements for both baseline models when semantic relevance is incorporated. This behavior holds across both mAP@25 and mAP@50 metrics, regardless of the decoder architecture. WordRetrievalNet achieves the most substantial gain, with mAP@25 increasing by 2.3 % (from 94.31 % to 96.59 % ), while KWS-Simplified improves by 1.2 % (from 89.74 % to 90.94 % ).
Although the relative improvements in mAP@50 are more modest (typically less than 1 % ), this does not fully capture the value of our method. The high baseline performance suggests that remaining false negatives are inherently difficult cases, often involving retrieved instances that fall below the IoU threshold due to imperfect bounding box predictions by the underlying word spotter. As a result, even when re-ranking elevates a semantically relevant instance, it may not be counted as correct under the strict overlap criterion. This evaluation limitation underscores the need for post-retrieval refinements that can adjust the predicted bounding boxes. Future extensions such as query expansion [43], late fusion [88], or joint re-ranking and refinement modules could alleviate this issue.
A natural direction for future exploration involves moving beyond late fusion toward fully trainable semantic re-ranking modules. Integrating fusion architectures inspired by recent vision-language models, such as CLIP-Rerankers [89] or BLIP [90], could enable richer cross-modal interactions between query and candidate embeddings. While such models introduce additional training complexity, they offer the potential to jointly optimize retrieval, decoding, and semantic alignment in a single unified pipeline.

5. Conclusions

This work introduced a lightweight, modular re-ranking pipeline for segmentation-free keyword spotting (KWS) that incorporates semantic relevance signals derived from large pre-trained language models. Motivated by the observation that many top-ranked results from visual-only KWS systems are semantically misaligned with the user’s intent, we proposed a simple yet effective mechanism that blends visual and semantic similarity scores to re-order retrieved word instances without retraining any component of the baseline system.
Experimental results on both the George Washington and IAM datasets validate our initial hypothesis: semantic feedback can systematically enhance precision while maintaining recall. Considering two representative segmentation-free baselines (WordRetrievalNet and KWS-Simplified) and different decoding strategies, consistent improvements in Mean Average Precision (mAP) were observed, most notably, up to a ∼2.3% absolute gain in mAP@25 for WordRetrievalNet on GW, as well as a ∼3% improvement on IAM when the WordRetrievalNet is paired with TrOCR. This confirms that integrating contextual word embeddings into the retrieval loop improves retrieval quality even in recognition-free setups.
The robustness of our framework is further supported by its low variance among all cross-validation folds and its insensitivity to the choice of semantic embedding model. The optimal fusion weights suggest that a balanced contribution between verbatim and semantic signals yields the best performance. Furthermore, qualitative examples demonstrate that semantically relevant but visually dissimilar terms can be effectively elevated in the ranking, demonstrating the ability of the method to perform beyond mere pattern matching.
Looking ahead, several promising research directions emerge:
  • Exploring transformer-based late fusion strategies that support end-to-end trainable re-ranking, inspired by cross-modal architectures such as CLIP-Rerankers and BLIP;
  • Incorporating dynamic query expansion mechanisms via LLMs to improve semantic coverage and reduce reliance on exact phrasing;
  • Integrating joint optimization objectives with vision–language models to enhance alignment between visual content and textual intent.
We also acknowledge the emergence of hybrid architectures in document layout analysis that integrate object detectors with Vision Transformers. A notable example is the Vision Grid Transformer (VGT) proposed by [91], which adopts a two-stream framework combining visual and grid-based textual features, achieving state-of-the-art results across multiple benchmarks. Despite their impressive performance, such models typically rely on large-scale annotated datasets—a requirement that is often infeasible to meet in historical manuscript collections. This limitation reinforces the relevance of our segmentation-free KWS approach, which is designed to operate effectively even under limited supervision.
Ultimately, our findings encourage a shift in perspective wherein deep language models need not be confined to transcription but can actively contribute to the semantic understanding and retrieval of handwritten documents. We hope this work paves the way toward more intelligent, generalizable, and semantically aware document analysis systems.

Author Contributions

Conceptualization, A.P.G. and C.N.; methodology, A.P.G. and S.P.; software, S.P.; validation, A.P.G. and S.P.; formal analysis, A.P.G. and S.P.; investigation, A.P.G. and S.P.; resources, A.P.G. and S.P.; data curation, S.P.; writing—original draft preparation, A.P.G. and S.P.; writing—review and editing, A.P.G. and C.N.; visualization, S.P.; supervision, A.P.G. and C.N.; project administration, C.N.; funding acquisition, A.P.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been co-financed by the European Union–NextGenerationEU through the National Recovery and Resilience Plan “Greece 2.0”, under the Action “Clusters of Research Excellence (CREs)” (Project: DEVIATE, Project Code: YP3TA-0560419, MIS: 5180519), implemented by the Executive Structure of the NSRF of the Ministry of Education, Religious Affairs and Sports.

Data Availability Statement

The data and code presented in this study are publicly available at https://github.com/stevepapazis/kws-semantic-reranking (accessed on 28 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Giotis, A.P.; Sfikas, G.; Gatos, B.; Nikou, C. A survey of document image word spotting techniques. Pattern Recognit. 2017, 68, 310–332. [Google Scholar] [CrossRef]
  2. Almazán, J.; Gordo, A.; Fornes, A.; Valveny, E. Word Spotting and Recognition with Embedded Attributes. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 2552–2566. [Google Scholar] [CrossRef] [PubMed]
  3. Sudholt, S.; Fink, G.A. Evaluating Word String Embeddings and Loss Functions for CNN-Based Word Spotting. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; pp. 493–498. [Google Scholar] [CrossRef]
  4. Wei, H.; Zhang, J.; Liu, K. A Hybrid Representation of Word Images for Keyword Spotting. In Proceedings of the Neural Information Processing, ICONIP 2020, Bangkok, Thailand, 18–22 November 2020; Yang, H., Pasupa, K., Leung, A., Kwok, J., Chan, J., King, I., Eds.; Communications in Computer and Information Science; Springer: Cham, Switzerland, 2020; Volume 1332, pp. 3–15. [Google Scholar] [CrossRef]
  5. Wolf, F.; Brandenbusch, K.; Fink, G.A. Improving Handwritten Word Synthesis for Annotation-free Word Spotting. In Proceedings of the 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), Dortmund, Germany, 8–10 September 2020; pp. 61–66. [Google Scholar] [CrossRef]
  6. Marcelli, A.; De Gregorio, G.; Santoro, A. A Model for Evaluating the Performance of a Multiple Keywords Spotting System for the Transcription of Historical Handwritten Documents. J. Imaging 2020, 6, 117. [Google Scholar] [CrossRef] [PubMed]
  7. Parziale, A.; Capriolo, G.; Marcelli, A. One Step Is Not Enough: A Multi-Step Procedure for Building the Training Set of a Query by String Keyword Spotting System to Assist the Transcription of Historical Document. J. Imaging 2020, 6, 109. [Google Scholar] [CrossRef] [PubMed]
  8. Cheikhrouhou, A.; Kessentini, Y.; Kanoun, S. Multi-task learning for simultaneous script identification and keyword spotting in document images. Pattern Recognit. 2021, 113, 107832. [Google Scholar] [CrossRef]
  9. Daraee, F.; Mozaffari, S.; Razavi, S.M. Handwritten keyword spotting using deep neural networks and certainty prediction. Comput. Electr. Eng. 2021, 92, 107–111. [Google Scholar] [CrossRef]
  10. Wolf, F.; Fischer, A.; Fink, G. Graph Convolutional Neural Networks for Learning Attribute Representations for Word Spotting. In Proceedings of the Document Analysis and Recognition–ICDAR 2021, Lausanne, Switzerland, 5–10 September 2021; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2021; Volume 12821, pp. 51–66. [Google Scholar] [CrossRef]
  11. Retsinas, G.; Sfikas, G.; Nikou, C.; Maragos, P. From Seq2Seq Recognition to Handwritten Word Embeddings. In Proceedings of the 32nd British Machine Vision Conference (BMVC), Online, 22–25 November 2021; pp. 1–14. Available online: https://www.bmvc2021-virtualconference.com/assets/papers/1481.pdf (accessed on 4 May 2025).
  12. Kundu, S.; Malakar, S.; Geem, Z.; Moon, Y.; Singh, P.; Sarkar, R. Hough Transform-Based Angular Features for Learning-Free Handwritten Keyword Spotting. Sensors 2021, 21, 4648. [Google Scholar] [CrossRef]
  13. Majumder, S.; Ghosh, S.; Malakar, S.; Sarkar, R.; Nasipuri, M. A voting-based technique for word spotting in handwritten document images. Multimed. Tools Appl. 2021, 80, 12411–12434. [Google Scholar] [CrossRef]
  14. De Gregorio, G.; Biswas, S.; Souibgui, M.A.; Bensalah, A.; Lladós, J.; Fornés, A.; Marcelli, A. A Few Shot Multi-representation Approach for N-Gram Spotting in Historical Manuscripts. In Proceedings of the Frontiers in Handwriting Recognition, ICFHR 2022, Hyderabad, India, 4–7 December 2022; Porwal, U., Fornés, A., Shafait, F., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2022; Volume 13639, pp. 3–17. [Google Scholar] [CrossRef]
  15. Ghilas, H.; Gagaoua, M.; Tari, A.; Cheriet, M. Spatial Distribution of Ink at Keypoints (SDIK): A Novel Feature for Word Spotting in Arabic Documents. Int. J. Image Graph. 2022, 22, 2250035. [Google Scholar] [CrossRef]
  16. Gongidi, S.; Jawahar, C. Handwritten Text Retrieval from Unlabeled Collections. In Computer Vision and Image Processing (CVIP 2021), Rupnagar, India, 3–5 December 2021; Communications in Computer and Information Science; Springer: Cham, Switzerland, 2022; Volume 1568, pp. 3–13. [Google Scholar] [CrossRef]
  17. Giotis, A.P.; Sfikas, G.; Nikou, C. Adversarial Deep Features for Weakly Supervised Document Image Keyword Spotting. In Proceedings of the 2022 IEEE 14th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), Nafplio, Greece,, 26–29 June 2022; pp. 1–5. [Google Scholar] [CrossRef]
  18. Banerjee, D.; Bhowal, P.; Malakar, S.; Cuevas, E.; Pérez-Cisneros, M.; Sarkar, R. Z-Transform-Based Profile Matching to Develop a Learning-Free Keyword Spotting Method for Handwritten Document Images. Int. J. Comput. Intell. Syst. 2022, 15, 93. [Google Scholar] [CrossRef]
  19. Krishnan, P.; Dutta, K.; Jawahar, C. HWNet v3: A joint embedding framework for recognition and retrieval of handwritten text. Int. J. Doc. Anal. Recognit. (IJDAR) 2023, 26, 401–417. [Google Scholar] [CrossRef]
  20. Vidal, E.; Toselli, A.; Puigcerver, J. Lexicon-based probabilistic indexing of handwritten text images. Neural Comput. Appl. 2023, 35, 17501–17520. [Google Scholar] [CrossRef]
  21. Wolf, F.; Fink, G.A. Self-training for handwritten word recognition and retrieval. Int. J. Doc. Anal. Recognit. (IJDAR) 2024, 27, 225–244. [Google Scholar] [CrossRef]
  22. Matos, A.; Almeida, P.; Correia, P.; Pacheco, O. iForal: Automated Handwritten Text Transcription for Historical Medieval Manuscripts. J. Imaging 2025, 11, 36. [Google Scholar] [CrossRef] [PubMed]
  23. Khamekhem Jemni, S.; Ammar, S.; Souibgui, M.A.; Kessentini, Y.; Cheddad, A. ST-KeyS: Self-supervised Transformer for Keyword Spotting in historical handwritten documents. Pattern Recognit. 2026, 170, 112036. [Google Scholar] [CrossRef]
  24. Rusiñol, M.; Aldavert, D.; Toledo, R.; Llados, J. Browsing Heterogeneous Document Collections by a Segmentation-Free Word Spotting Method. In Proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR), Beijing, China, 18–21 September 2011; pp. 63–67. [Google Scholar]
  25. Almazán, J.; Gordo, A.; Fornes, A.; Valveny, E. Efficient Exemplar Word Spotting. In Proceedings of the 23rd British Machine Vision Conference (BMVC), Guildford, UK, 3–7 September 2012; pp. 67.1–67.11. [Google Scholar]
  26. Kovalchuk, A.; Wolf, L.; Dershowitz, N. A Simple and Fast Word Spotting Method. In Proceedings of the 14th International Conference on Frontiers in Handwriting Recognition (ICFHR), Crete, Greece, 1–4 September 2014; pp. 3–8. [Google Scholar]
  27. Rothacker, L.; Fink, G.A. Segmentation-free Query-by-String Word Spotting with Bag-of-Features HMMs. In Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 661–665. [Google Scholar]
  28. Wilkinson, T.; Lindström, J.; Brun, A. Neural Ctrl-F: Segmentation-Free Query-by-String Word Spotting in Handwritten Manuscript Collections. In Proceedings of the 16th IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4443–4452. [Google Scholar] [CrossRef]
  29. Zhao, P.; Xue, W.; Li, Q.; Cai, S. Query by Strings and Return Ranking Word Regions with Only One Look. In Proceedings of the Asian Conference on Computer Vision (ACCV), Kyoto, Japan, 30 November–4 December 2020; Ishikawa, H., Liu, C.L., Pajdla, T., Shi, J., Eds.; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2021; Volume 12622, pp. 3–18. [Google Scholar] [CrossRef]
  30. Wilkinson, T.; Nettelblad, C. Bootstrapping Weakly Supervised Segmentation-free Word Spotting through HMM-based Alignment. In Proceedings of the 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), Dortmund, Germany, 8–10 September 2020; pp. 49–54. [Google Scholar] [CrossRef]
  31. Prabhakar, C. Segmentation-Free Word Spotting in Handwritten Documents Using Scale Space Co-HoG Feature Descriptors. In Applications of Advanced Machine Intelligence in Computer Vision and Object Recognition: Emerging Research and Opportunities; Chakraborty, S., Mali, K., Eds.; IGI Global: Hershey, PA, USA, 2020; pp. 219–247. [Google Scholar] [CrossRef]
  32. Das, S.; Mandal, S. Segmentation-free word spotting in historical Bangla handwritten document using Wave Kernel Signature. Pattern Anal. Appl. 2020, 23, 593–610. [Google Scholar] [CrossRef]
  33. Retsinas, G.; Sfikas, G.; Nikou, C. Keyword Spotting Simplified: A Segmentation-Free Approach Using Character Counting and CTC Re-scoring. In Proceedings of the International Conference on Document Analysis and Recognition, San Jose, CA, USA, 21–26 August 2023; Springer: Cham, Switzerland, 2023; pp. 446–464. [Google Scholar]
  34. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the Workshop at the International Conference on Learning Representations (ICLR), Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
  35. Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef]
  36. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
  37. Li, M.; Lv, T.; Chen, J.; Cui, L.; Lu, Y.; Florencio, D.; Zhang, C.; Li, Z.; Wei, F. TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 13094–13102. [Google Scholar]
  38. Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3982–3992. [Google Scholar] [CrossRef]
  39. Leydier, Y.; Ouji, A.; LeBourgeois, F.; Emptoz, H. Towards an Omnilingual Word Retrieval System for Ancient Manuscripts. Pattern Recognit. 2009, 42, 2089–2105. [Google Scholar] [CrossRef]
  40. Zhang, X.; Tan, C.L. Handwritten word image matching based on Heat Kernel Signature. Pattern Recognit. 2015, 48, 3346–3356. [Google Scholar] [CrossRef]
  41. Gatos, B.; Pratikakis, I. Segmentation-free word spotting in historical printed documents. In Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR), Barcelona, Spain, 26–29 July 2009; pp. 271–275. [Google Scholar]
  42. Rothacker, L.; Rusiñol, M.; Fink, G.A. Bag-of-Features HMMs for Segmentation-Free Word Spotting in Handwritten Documents. In Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR), Washington, DC, USA, 25–28 August 2013; pp. 1305–1309. [Google Scholar]
  43. Almazán, J.; Gordo, A.; Fornes, A.; Valveny, E. Segmentation-free word spotting with exemplar SVMs. Pattern Recognit. 2014, 47, 3967–3978. [Google Scholar] [CrossRef]
  44. Riba, P.; Llados, J.; Fornes, A. Handwritten Word Spotting by Inexact Matching of Grapheme Graphs. In Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 781–785. [Google Scholar]
  45. Zagoris, K.; Pratikakis, I.; Gatos, B. Unsupervised Word Spotting in Historical Handwritten Document Images Using Document-Oriented Local Features. IEEE Trans. Image Process. 2017, 26, 4032–4041. [Google Scholar] [CrossRef]
  46. Ghosh, S.K.; Valveny, E. R-PHOC: Segmentation-Free Word Spotting Using CNN. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; pp. 801–806. [Google Scholar] [CrossRef]
  47. Wilkinson, T.; Brun, A. Semantic and Verbatim Word Spotting Using Deep Neural Networks. In Proceedings of the 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), Shenzhen, China, 23–26 October 2016; pp. 307–312. [Google Scholar] [CrossRef]
  48. Rothacker, L.; Sudholt, S.; Rusakov, E.; Kasperidus, M.; Fink, G.A. Word Hypotheses for Segmentation-Free Word Spotting in Historic Document Images. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; pp. 1174–1179. [Google Scholar] [CrossRef]
  49. Tüselmann, O.; Fink, G.A. Exploring semantic word representations for recognition-free NLP on handwritten document images. In Proceedings of the International Conference on Document Analysis and Recognition, San Jose, CA, USA, 21–26 August 2023; Springer: Cham, Switzerland, 2023; pp. 85–100. [Google Scholar]
  50. He, B. Rocchio’s Formula. In Encyclopedia of Database Systems; Springer: New York, NY, USA, 2009; p. 2447. [Google Scholar]
  51. Rusiñol, M.; Llados, J. Boosting the handwritten word spotting experience by including the user in the loop. Pattern Recognit. 2014, 47, 1063–1072. [Google Scholar] [CrossRef]
  52. Wolf, F.; Oberdiek, P.; Fink, G. Exploring Confidence Measures for Word Spotting in Heterogeneous Datasets. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 20–25 September 2019; pp. 583–588. [Google Scholar] [CrossRef]
  53. Nara, R.; Yamaguchi, S.; Ito, K.; Yoshie, O. Revisiting Relevance Feedback for CLIP-Based Interactive Image Retrieval. In Proceedings of the Computer Vision–ECCV 2024 Workshops, Milan, Italy, 29 September–4 October 2024; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 15639, pp. 3–20. [Google Scholar] [CrossRef]
  54. Ghosh, S.; Valveny, E. A Sliding Window Framework for Word Spotting Based on Word Attributes. In Proceedings of the 7th Iberian Conference on Pattern Recognition and Image Analysis (PRAI), Santiago de Compostela, Spain, 17–19 June 2015; pp. 652–661. [Google Scholar]
  55. Shekhar, R.; Jawahar, C. Word Image Retrieval Using Bag of Visual Words. In Proceedings of the 10th IAPR International Workshop on Document Analysis Systems (DAS), Gold Coast, QLD, Australia, 27–29 March 2012; pp. 297–301. [Google Scholar]
  56. Vats, E.; Hast, A.; Fornés, A. Training-Free and Segmentation-Free Word Spotting using Feature Matching and Query Expansion. In Proceedings of the 15th International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 20–25 September 2019; pp. 1294–1299. [Google Scholar] [CrossRef]
  57. Chuang, Y.S.; Fang, W.; Li, S.W.; Yih, W.-t.; Glass, J. Expand, Rerank, and Retrieve: Query Reranking for Open-Domain Question Answering. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 12131–12147. [Google Scholar]
  58. Ou, W.; Huynh, V.N. Conditional variational autoencoder for query expansion in ad-hoc information retrieval. Inf. Sci. 2024, 652, 119764. [Google Scholar] [CrossRef]
  59. Djoudi, K.; Alimazighi, Z.; Hedjazi, B.D. Information retrieval with query expansion and re-ranking: A survey. In Proceedings of the 2nd International Conference on Emerging Trends and Applications in Artificial Intelligence (ICETAI 2024), Baghdad, Iraq, 2–3 October 2024; The Institution of Engineering and Technology: Stevenage, UK, 2025; Volume 2024, pp. 114–119. [Google Scholar] [CrossRef]
  60. Tüselmann, O.; Wolf, F.; Fink, G.A. Identifying and tackling key challenges in semantic word spotting. In Proceedings of the 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), Dortmund, Germany, 8–10 September 2020; pp. 55–60. [Google Scholar]
  61. Krishnan, P.; Jawahar, C. Bringing Semantics in Word Image Retrieval. In Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR), Washington, DC, USA, 25–28 August 2013; pp. 733–737. [Google Scholar]
  62. Gordo, A.; Almazán, J.; Murray, N.; Perronin, F. LEWIS: Latent Embeddings for Word Images and Their Semantics. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1242–1250. [Google Scholar]
  63. Miller, G.A. WordNet: A lexical database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
  64. Sahlgren, M. The distributional hypothesis. Ital. J. Linguist. 2008, 20, 33–53. [Google Scholar]
  65. Krishnan, P.; Jawahar, C. Bringing semantics into word image representation. Pattern Recognit. 2020, 108, 107542. [Google Scholar] [CrossRef]
  66. Washington, G. The Writings of George Washington from the Original Manuscript Sources, 1745–1799; Fitzpatrick, J., Ed.; U.S. Government Printing Office: Washington, DC, USA, 1931.
  67. Song, K.; Tan, X.; Qin, T.; Lu, J.; Liu, T.Y. MPNet: Masked and Permuted Pre-training for Language Understanding. Adv. Neural Inf. Process. Syst. 2020, 33, 16857–16867. [Google Scholar]
  68. Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. Adv. Neural Inf. Process. Syst. 2020, 33, 5776–5788. [Google Scholar]
  69. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
  70. Wang, W.; Xie, E.; Li, X.; Hou, W.; Lu, T.; Yu, G.; Shao, S. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9336–9345. [Google Scholar]
  71. Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
  72. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
  73. Tian, J.; Yan, B.; Yu, J.; Weng, C.; Yu, D.; Watanabe, S. Bayes Risk CTC: Controllable CTC Alignment in Sequence-to-Sequence Tasks. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023; Available online: https://openreview.net/forum?id=Bd7GueaTxUz (accessed on 19 June 2025).
  74. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  75. Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
  76. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 4 May 2021; Available online: https://openreview.net/forum?id=YicbFdNTTy (accessed on 13 May 2025).
  77. Lavrenko, V.; Rath, T.M.; Manmatha, R. Holistic word recognition for handwritten historical documents. In Proceedings of the 1st International Workshop on Document Image Analysis for Libraries, Palo Alto, CA, USA, 23–24 January 2004; pp. 278–287. [Google Scholar]
  78. Marti, U.V.; Bunke, H. The IAM-database: An English sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recognit. 2002, 5, 39–46. [Google Scholar] [CrossRef]
  79. Washington, G. George Washington Papers, Series 2, Letterbooks 1754–1799: Letterbook 1, Aug. 11, 1754–Dec. 25, 1755; 1755; pp. 270–279, 300–309. Manuscript/Mixed Material. Available online: https://www.loc.gov/item/mgw2.001/ (accessed on 30 April 2025).
  80. Model Card: TrOCR. Available online: https://huggingface.co/microsoft/trocr-base-handwritten (accessed on 28 May 2025).
  81. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019; Available online: https://openreview.net/forum?id=Bkg6RiCqY7 (accessed on 15 May 2025).
  82. SentenceTransformers v5.0. Available online: https://www.sbert.net/ (accessed on 28 May 2025).
  83. Model Card: Stsb-Roberta-Base. Available online: https://huggingface.co/sentence-transformers/stsb-roberta-base (accessed on 28 May 2025).
  84. Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; Specia, L. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, BC, Canada, 3–4 August 2017; Bethard, S., Carpuat, M., Apidianaki, M., Mohammad, S.M., Cer, D., Jurgens, D., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 1–14. [Google Scholar] [CrossRef]
  85. Model Card: All-Mpnet-Base-v2. Available online: https://huggingface.co/sentence-transformers/all-mpnet-base-v2 (accessed on 28 May 2025).
  86. Model Card: All-MiniLM-L12-v2. Available online: https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 (accessed on 28 May 2025).
  87. Dey, S.; Nicolaou, A.; Lladós, J.; Pal, U. Evaluation of word spotting under improper segmentation scenario. Int. J. Doc. Anal. Recognit. (IJDAR) 2019, 22, 361–374. [Google Scholar] [CrossRef]
  88. Saad-Falcon, J.; Khattab, O.; Santhanam, K.; Florian, R.; Franz, M.; Roukos, S.; Sil, A.; Sultan, M.; Potts, C. UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 11265–11279. [Google Scholar] [CrossRef]
  89. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), PMLR, Virtual, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]
  90. Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models. In Proceedings of the 40th International Conference on Machine Learning (ICML), ICML’23, Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 19730–19742. [Google Scholar]
  91. Da, C.; Luo, C.; Zheng, Q.; Yao, C. Vision Grid Transformer for Document Layout Analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 19462–19471. [Google Scholar]
Figure 1. Sometimes, relying solely on visual information can lead to ambiguity. (a) The presented word could be interpreted as “tcw”, “ton”, or “ten” based on visual cues alone. With word-level semantic context, “tcw” is rejected as meaningless. (b) With sentence-level semantic context, the word “ten” is correctly identified.
Figure 1. Sometimes, relying solely on visual information can lead to ambiguity. (a) The presented word could be interpreted as “tcw”, “ton”, or “ten” based on visual cues alone. With word-level semantic context, “tcw” is rejected as meaningless. (b) With sentence-level semantic context, the word “ten” is correctly identified.
Electronics 14 02900 g001
Figure 2. Proposed semantic relevance framework integrating LLM-based contextual similarities into the re-ranking of candidate word instances in segmentation-free KWS ranking lists.
Figure 2. Proposed semantic relevance framework integrating LLM-based contextual similarities into the re-ranking of candidate word instances in segmentation-free KWS ranking lists.
Electronics 14 02900 g002
Figure 3. The WordRetrievalNet architecture.
Figure 3. The WordRetrievalNet architecture.
Electronics 14 02900 g003
Figure 4. An overview of the KWS-Simplified network.
Figure 4. An overview of the KWS-Simplified network.
Electronics 14 02900 g004
Figure 5. The post-processing stages of the KWS-Simplified pipeline.
Figure 5. The post-processing stages of the KWS-Simplified pipeline.
Electronics 14 02900 g005
Figure 6. mAP@25 and mAP@50 curves for the WordRetrievalNet baseline paired with the weighted combination strategy reflecting the results from Table 3 on the GW (top) and IAM (bottom) benchmarks.
Figure 6. mAP@25 and mAP@50 curves for the WordRetrievalNet baseline paired with the weighted combination strategy reflecting the results from Table 3 on the GW (top) and IAM (bottom) benchmarks.
Electronics 14 02900 g006
Figure 7. mAP@25 and mAP@50 curves for the KWS-Simplified baseline paired with the weighted combination strategy reflecting the results from Table 4 on the GW (top) and IAM (bottom) benchmarks.
Figure 7. mAP@25 and mAP@50 curves for the KWS-Simplified baseline paired with the weighted combination strategy reflecting the results from Table 4 on the GW (top) and IAM (bottom) benchmarks.
Electronics 14 02900 g007
Figure 8. mAP@25 and mAP@50 curves for the semantic pruning approach corresponding to Table 5 on GW, for the WordRetrievalNet (top) and KWS-Simplified (bottom) baselines, respectively.
Figure 8. mAP@25 and mAP@50 curves for the semantic pruning approach corresponding to Table 5 on GW, for the WordRetrievalNet (top) and KWS-Simplified (bottom) baselines, respectively.
Electronics 14 02900 g008
Figure 9. Top ten ranked lists for the query “forgot”: verbatim (purple), semantic (yellow), and combined (blue) ranking. The rightmost part of each snippet represents corresponding similarities.
Figure 9. Top ten ranked lists for the query “forgot”: verbatim (purple), semantic (yellow), and combined (blue) ranking. The rightmost part of each snippet represents corresponding similarities.
Electronics 14 02900 g009
Figure 10. Top 21 ranked lists for the query “soldiers”: verbatim (purple), semantic (yellow), and combined (blue) ranking. The rightmost part of each snippet represents corresponding similarities.
Figure 10. Top 21 ranked lists for the query “soldiers”: verbatim (purple), semantic (yellow), and combined (blue) ranking. The rightmost part of each snippet represents corresponding similarities.
Electronics 14 02900 g010
Table 1. The partition of GW used in our experiments.
Table 1. The partition of GW used in our experiments.
Fold No.Page IDs Across Each Fold
1274, 309, 276, 272, 303
2306, 273, 301, 300, 278
3270, 302, 277, 275, 308
4307, 304, 279, 271, 305
Table 2. Comparison of reproduced and reported mAP scores and their standard deviations for the two baseline models, WordRetrievalNet and KWS-Simplified, on the GW dataset.
Table 2. Comparison of reproduced and reported mAP scores and their standard deviations for the two baseline models, WordRetrievalNet and KWS-Simplified, on the GW dataset.
ModelmAP@25mAP@50
WordRetrievalNet (Reproduced)94.31 ± 1.888.29 ± 4.0
WordRetrievalNet (Reported)96.4694.06
KWS-Simplified (Reproduced)89.74 ± 0.772.29 ± 3.0
KWS-Simplified (Reported)91.666.4
Table 3. Mean Average Precision of the WordRetrievalNet backbone with the weighted combination strategy across semantic importance thresholds and embeddings.
Table 3. Mean Average Precision of the WordRetrievalNet backbone with the weighted combination strategy across semantic importance thresholds and embeddings.
Semantic
Weight
Semantic
LLM
George Washington DatasetIAM Handwriting Database
KWS-Simplified Decoder TrOCR Decoder KWS-Simplified Decoder TrOCR Decoder
mAP@25 mAP@50 mAP@25 mAP@50 mAP@25 mAP@50 mAP@25 mAP@50
0.0 *all
-MiniLM
-L12
-v2
94.31 ± 1.888.29 ± 4.094.31 ± 1.888.29 ± 4.079.1572.8579.1572.85
0.195.80 ± 1.589.51 ± 3.894.51 ± 1.888.43 ± 4.080.6073.9882.0475.40
0.296.10 ± 1.589.75 ± 3.893.96 ± 2.087.84 ± 4.079.2972.6280.5973.77
0.396.30 ± 1.389.79 ± 3.793.20 ± 2.187.07 ± 3.875.8769.3077.6871.05
0.496.04 ± 1.589.64 ± 3.691.71 ± 2.585.73 ± 3.971.3665.1374.0467.66
0.595.34 ± 1.489.01 ± 3.589.05 ± 2.983.19 ± 3.865.9560.1970.6864.57
0.694.39 ± 1.588.20 ± 3.584.82 ± 4.279.41 ± 4.260.2654.9967.1761.33
0.793.25 ± 1.687.37 ± 3.579.04 ± 4.874.03 ± 3.755.9550.9964.4758.88
0.891.92 ± 1.986.26 ± 3.573.12 ± 6.168.42 ± 4.553.1248.3862.8857.42
0.990.93 ± 2.085.42 ± 3.669.68 ± 6.665.20 ± 4.851.4646.8862.2256.83
1.089.83 ± 2.284.28 ± 3.764.58 ± 7.560.07 ± 6.047.9643.7154.3449.48
0.0 *all
-mpnet
-base
-v2
94.31 ± 1.888.29 ± 4.094.31 ± 1.888.29 ± 4.079.1572.8579.1572.85
0.195.90 ± 1.489.67 ± 3.894.68 ± 1.688.62 ± 3.980.8474.1882.1275.43
0.296.27 ± 1.389.86 ± 3.894.09 ± 1.787.98 ± 3.880.0173.2180.7774.02
0.396.50 ± 1.489.97 ± 4.093.07 ± 2.187.02 ± 3.777.5570.8277.8771.29
0.496.56 ± 1.590.08 ± 3.991.78 ± 2.485.85 ± 3.773.8667.2274.7368.33
0.596.00 ± 1.389.62 ± 3.689.42 ± 2.983.85 ± 3.869.2462.9571.2365.12
0.695.04 ± 1.288.82 ± 3.585.13 ± 3.679.84 ± 3.863.7057.9767.8461.99
0.793.83 ± 1.687.86 ± 3.479.10 ± 5.374.10 ± 4.158.1852.9765.4359.78
0.892.57 ± 1.786.71 ± 3.474.06 ± 6.269.30 ± 4.653.9249.1063.8058.25
0.991.53 ± 1.885.84 ± 3.670.50 ± 6.465.90 ± 4.651.4346.8962.7857.36
1.090.21 ± 2.184.45 ± 3.764.91 ± 7.560.31 ± 6.049.2644.8454.9950.11
0.0 *stsb
-roberta
-base
94.31 ± 1.888.29 ± 4.094.31 ± 1.888.29 ± 4.079.1572.8579.1572.85
0.195.83 ± 1.689.60 ± 3.994.58 ± 2.088.55 ± 4.281.1674.4981.8875.18
0.296.19 ± 1.389.82 ± 3.794.01 ± 2.187.91 ± 4.280.3973.5580.9774.22
0.396.59 ± 1.390.17 ± 3.792.87 ± 2.986.92 ± 4.577.4070.6877.5170.90
0.496.44 ± 1.689.99 ± 3.790.75 ± 3.284.96 ± 4.473.5466.9474.2967.87
0.596.32 ± 1.689.90 ± 3.787.91 ± 3.582.32 ± 4.369.4563.2571.2865.11
0.696.04 ± 1.789.57 ± 3.884.15 ± 4.378.83 ± 4.964.8359.0568.7062.85
0.795.27 ± 1.788.91 ± 4.080.37 ± 4.475.25 ± 4.761.2855.7366.7161.03
0.894.37 ± 1.888.08 ± 3.977.09 ± 4.472.16 ± 4.358.2353.0565.1659.66
0.993.54 ± 2.087.46 ± 4.074.05 ± 4.869.22 ± 4.156.2251.2264.1358.74
1.092.32 ± 2.286.19 ± 3.967.98 ± 5.463.34 ± 4.353.9949.1156.0351.24
* This is essentially the baseline model. It does not use a decoder.
Table 4. Mean Average Precision of the KWS-Simplified backbone with the weighted combination strategy across semantic importance thresholds and embeddings.
Table 4. Mean Average Precision of the KWS-Simplified backbone with the weighted combination strategy across semantic importance thresholds and embeddings.
Semantic
Weight
Semantic
LLM
George Washington DatasetIAM Handwriting Database
KWS-Simplified Decoder TrOCR Decoder KWS-Simplified Decoder TrOCR Decoder
mAP@25 mAP@50 mAP@25 mAP@50 mAP@25 mAP@50 mAP@25 mAP@50
0.0 *all
-MiniLM
-L12
-v2
89.74 ± 0.772.29 ± 3.089.74 ± 0.772.29 ± 3.086.4063.7386.4063.73
0.190.62 ± 0.572.77 ± 3.090.27 ± 0.872.53 ± 3.186.7163.8687.4964.22
0.290.68 ± 0.472.82 ± 3.090.38 ± 0.772.57 ± 3.186.7963.9487.8464.47
0.390.70 ± 0.572.84 ± 3.090.35 ± 0.772.55 ± 3.286.5763.7288.0164.58
0.490.72 ± 0.572.84 ± 3.090.30 ± 0.772.51 ± 3.286.1163.3688.0664.60
0.590.79 ± 0.472.90 ± 3.090.32 ± 0.672.55 ± 3.285.6563.0187.7464.43
0.690.85 ± 0.572.93 ± 3.090.32 ± 0.672.54 ± 3.284.6662.3887.2864.17
0.790.91 ± 0.572.97 ± 3.090.26 ± 0.772.51 ± 3.283.3261.4786.3063.56
0.890.83 ± 0.472.90 ± 2.990.12 ± 0.672.42 ± 3.181.3160.2084.6662.49
0.990.82 ± 0.472.89 ± 2.989.93 ± 0.672.29 ± 3.077.6157.8581.5160.47
1.090.43 ± 0.772.73 ± 2.787.16 ± 0.670.39 ± 2.171.9754.3976.7457.24
0.0 *all
-mpnet
-base
-v2
89.74 ± 0.772.29 ± 3.089.74 ± 0.772.29 ± 3.086.4063.7386.4063.73
0.190.63 ± 0.572.78 ± 3.090.22 ± 0.772.49 ± 3.186.6963.8887.5364.25
0.290.69 ± 0.572.83 ± 3.090.31 ± 0.772.51 ± 3.186.7063.8787.8964.51
0.390.72 ± 0.572.84 ± 3.090.29 ± 0.672.49 ± 3.186.5663.7588.1664.80
0.490.81 ± 0.572.91 ± 3.190.32 ± 0.772.54 ± 3.186.3263.5988.2564.88
0.590.86 ± 0.572.96 ± 3.190.29 ± 0.772.53 ± 3.285.9163.3387.9764.69
0.690.87 ± 0.472.94 ± 3.090.26 ± 0.672.49 ± 3.185.0362.7187.5264.37
0.790.94 ± 0.572.97 ± 3.190.10 ± 0.772.39 ± 3.184.1562.1086.6063.77
0.890.91 ± 0.572.96 ± 3.090.06 ± 0.672.35 ± 3.082.5661.0985.0762.78
0.990.85 ± 0.472.92 ± 3.089.74 ± 0.472.09 ± 2.979.4759.1981.9760.86
1.090.47 ± 0.572.78 ± 2.887.06 ± 0.770.23 ± 2.174.0455.8977.3057.78
0.0 *stsb
-roberta
-base
89.74 ± 0.772.29 ± 3.089.74 ± 0.772.29 ± 3.086.4063.7386.4063.73
0.190.62 ± 0.572.78 ± 3.090.21 ± 0.672.46 ± 3.086.5863.8287.3564.13
0.290.63 ± 0.572.79 ± 3.090.27 ± 0.672.47 ± 3.086.7863.8587.7864.52
0.390.66 ± 0.572.80 ± 3.090.27 ± 0.672.47 ± 3.186.7263.7488.0464.72
0.490.76 ± 0.572.88 ± 3.090.32 ± 0.672.52 ± 3.186.4563.5988.1264.84
0.590.89 ± 0.572.97 ± 3.190.33 ± 0.572.54 ± 3.185.8763.2987.6264.60
0.690.90 ± 0.472.97 ± 3.190.30 ± 0.472.50 ± 3.085.0962.7986.7864.16
0.790.92 ± 0.472.97 ± 3.190.13 ± 0.472.41 ± 2.983.8261.9985.4463.32
0.890.86 ± 0.472.92 ± 3.089.96 ± 0.572.28 ± 2.981.9660.7983.0661.79
0.990.74 ± 0.472.86 ± 3.089.75 ± 0.572.12 ± 2.879.4559.1280.1759.88
1.090.33 ± 0.572.71 ± 2.887.08 ± 0.970.24 ± 2.076.1057.1876.4957.40
* This is essentially the baseline model. It does not use a decoder.
Table 5. mAP comparison of semantic pruning pipelines on the GW dataset.
Table 5. mAP comparison of semantic pruning pipelines on the GW dataset.
ThresholdSemantic
LLM
WordRetrievalNetKWS-Simplified
KWS-Simplified Decoder TrOCR Decoder KWS-Simplified Decoder TrOCR Decoder
mAP@25 mAP@50 mAP@25 mAP@50 mAP@25 mAP@50 mAP@25 mAP@50
0.1all
-MiniLM
-L12
-v2
94.25 ± 1.788.24 ± 3.994.16 ± 1.788.20 ± 4.189.63 ± 0.872.18 ± 3.189.50 ± 0.972.05 ± 3.0
0.392.02 ± 2.386.32 ± 3.777.32 ± 6.371.77 ± 4.387.52 ± 1.270.26 ± 3.771.46 ± 5.156.03 ± 3.9
0.589.87 ± 2.384.52 ± 3.767.59 ± 6.763.15 ± 4.886.53 ± 1.669.69 ± 4.163.25 ± 6.149.08 ± 4.3
0.788.58 ± 2.683.68 ± 3.863.25 ± 6.359.34 ± 4.785.62 ± 2.269.01 ± 4.759.35 ± 5.945.98 ± 4.3
0.987.81 ± 2.783.06 ± 3.860.10 ± 6.656.45 ± 5.385.24 ± 2.368.71 ± 4.756.53 ± 5.943.64 ± 4.4
0.1all
-mpnet
-base
-v2
94.29 ± 1.888.28 ± 4.093.96 ± 2.187.97 ± 4.289.72 ± 0.772.27 ± 3.089.34 ± 0.971.90 ± 2.9
0.391.74 ± 2.086.21 ± 3.476.40 ± 5.871.46 ± 4.087.91 ± 1.470.75 ± 3.871.68 ± 4.956.63 ± 3.4
0.589.51 ± 2.284.27 ± 3.766.96 ± 7.062.76 ± 5.486.46 ± 2.069.72 ± 4.562.64 ± 5.748.51 ± 4.4
0.788.50 ± 2.383.51 ± 3.863.11 ± 6.259.20 ± 4.785.72 ± 1.969.04 ± 4.559.04 ± 5.845.68 ± 4.0
0.987.50 ± 2.982.76 ± 4.059.70 ± 6.856.05 ± 5.485.24 ± 2.568.71 ± 4.856.15 ± 6.043.26 ± 4.4
0.1stsb
-roberta
-base
94.33 ± 1.888.32 ± 4.093.49 ± 1.987.45 ± 4.289.74 ± 0.772.29 ± 3.088.96 ± 1.171.52 ± 3.2
0.394.32 ± 2.188.33 ± 4.285.81 ± 3.480.42 ± 4.389.05 ± 1.271.73 ± 3.481.33 ± 2.764.75 ± 3.0
0.592.95 ± 2.087.08 ± 3.773.83 ± 4.669.09 ± 4.088.00 ± 1.670.90 ± 4.169.24 ± 3.854.77 ± 3.5
0.790.72 ± 2.585.36 ± 3.765.99 ± 5.362.04 ± 4.086.28 ± 2.469.62 ± 4.762.33 ± 4.648.99 ± 3.3
0.987.82 ± 2.983.02 ± 3.960.61 ± 6.256.93 ± 4.985.35 ± 2.468.83 ± 4.857.18 ± 5.444.19 ± 3.9
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Papazis, S.; Giotis, A.P.; Nikou, C. Enhancing Keyword Spotting via NLP-Based Re-Ranking: Leveraging Semantic Relevance Feedback in the Handwritten Domain. Electronics 2025, 14, 2900. https://doi.org/10.3390/electronics14142900

AMA Style

Papazis S, Giotis AP, Nikou C. Enhancing Keyword Spotting via NLP-Based Re-Ranking: Leveraging Semantic Relevance Feedback in the Handwritten Domain. Electronics. 2025; 14(14):2900. https://doi.org/10.3390/electronics14142900

Chicago/Turabian Style

Papazis, Stergios, Angelos P. Giotis, and Christophoros Nikou. 2025. "Enhancing Keyword Spotting via NLP-Based Re-Ranking: Leveraging Semantic Relevance Feedback in the Handwritten Domain" Electronics 14, no. 14: 2900. https://doi.org/10.3390/electronics14142900

APA Style

Papazis, S., Giotis, A. P., & Nikou, C. (2025). Enhancing Keyword Spotting via NLP-Based Re-Ranking: Leveraging Semantic Relevance Feedback in the Handwritten Domain. Electronics, 14(14), 2900. https://doi.org/10.3390/electronics14142900

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop