Efficient Object-Related Scene Text Grouping Pipeline for Visual Scene Analysis in Large-Scale Investigative Data

Shinohara, Enrique; García, Jorge; Unzueta, Luis; Leškovský, Peter

doi:10.3390/electronics15010012

Open AccessArticle

Efficient Object-Related Scene Text Grouping Pipeline for Visual Scene Analysis in Large-Scale Investigative Data

by

Enrique Shinohara

^*

,

Jorge García

,

Luis Unzueta

and

Peter Leškovský

Department of Intelligent Security Video Analytics, Vicomtech, 20009 Donostia-San Sebastián, Spain

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(1), 12; https://doi.org/10.3390/electronics15010012

Submission received: 15 October 2025 / Revised: 10 December 2025 / Accepted: 16 December 2025 / Published: 19 December 2025

(This article belongs to the Special Issue Deep Learning-Based Scene Text Detection)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Law Enforcement Agencies (LEAs) typically analyse vast collections of media files, extracting visual information that helps them to advance investigations. While recent advancements in deep learning-based computer vision algorithms have revolutionised the ability to detect multi-class objects and text instances (characters, words, numbers) from in-the-wild scenes, their association remains relatively unexplored. Previous studies focus on clustering text given its semantic relationship or layout, rather than its relationship with objects. In this paper, we present an efficient, modular pipeline for contextual scene text grouping with three complementary strategies: 2D planar segmentation, multi-class instance segmentation and promptable segmentation. The strategies address common scenes where related text instances frequently share the same 2D planar surface and object (vehicle, banner, etc.). Evaluated on a custom dataset of 1100 images, the overall grouping performance remained consistently high across all three strategies (B-Cubed F1 92–95%; Pairwise F1 80–82%), with adjusted Rand indices between 0.08 and 0.23. Our results demonstrate clear trade-offs between computational efficiency and contextual generalisation, where geometric methods offer reliability, semantic approaches provide scalability and class-agnostic strategies offer the most robust generalisation. The dataset used for testing will be made available upon request.

Keywords:

object-level grouping; instance segmentation; planar surface detection; scene text recognition; scene understanding

1. Introduction

Scene understanding is a key focus in computer vision. In order to interpret visual information from a scene, machines must rely on a set of core concepts. These include object recognition, which involves localising and identifying objects and analysing spatial relationships to understand how objects are positioned relative to one another. A key concept here, often referred to as ‘contextual awareness’, is the ability to gain a better understanding of objects based on their relationships with their environment and with each other.

In this sense, the presence of text in natural scenes is a key element for achieving a richer, more granular level of scene understanding. Direct transcription of text from images through a process known as Optical Character Recognition (OCR) enables a wide range of applications, including the automated analysis of CCTV footage, autonomous navigation and enhanced information retrieval, which is vital for Law Enforcement Agencies (LEAs) when conducting investigations and gathering actionable intelligence. Recent advances in deep learning have dramatically improved the accuracy of algorithms focused on Scene Text Understanding [1]. In this context, modern Scene Text Recognition (STR) models [2,3,4,5] have made remarkable progress in detecting and transcribing text in the wild. This advancement has been driven in large part by community benchmarks and competitions [6,7,8], which continually push to new state-of-the-art methods achieving higher accuracy and robustness.

However, STR techniques tend to focus on word-level detection. This approach often fails to consider the ‘bigger picture’, treating detected text instances as individual pieces of information. For example, the company name and phone number on a delivery truck are semantically linked because they belong to the same entity. Similarly, text on a protest sign or storefront banner, as shown in Figure 1, contains crucial contextual information that distinguishes it from other text. The ability to group text and associate text instances that belong to an object is a crucial next step in enriching scene comprehension beyond just detecting and recognising text.

Scene text understanding can be conceptualised as a hierarchical pipeline, as shown in Figure 2. At the low level, individual words or lines are detected and recognised, forming the foundational elements of the analysis. Mid-level contextual analysis interprets spatial and structural relationships between these text instances, with traditional hierarchical layout analysis focusing on reading order. High-level contextual analysis then interprets these clusters semantically, streamlining the agents’ work as they conduct their investigations. The modular pipeline for scene text understanding presented herein operates at the mid-level of physical layout analysis.

While most existing methods deal with contextual awareness to improve word-level accuracy, they do not usually address the challenge of grouping detected text into object-aware clusters. Some approaches use clustering based on surrounding visual cues to handle blurry or occluded words, and recent works such as CUTE [9] and DRFormer [10] group text units into coherent reading sequences using graph structures. Long et al. [11] group text lines into document-style “paragraphs” with affinity matrices. Zhang et al. [12] created Hi-SAM, a foundation model that excels in segmentation across four hierarchies: pixel-level text, word, text line, and paragraph. The relevance of grouping words into a readable order and semantic coherence is undeniable, though it does not consider the importance of associating words by their shared space, which has a direct impact on investigative applications. Given the example of keyword search in LEA investigations, this lack of an object-level contextual grouping limits the possibility of extracting critical textual information related to the case.

To address this problem, in this study, we develop a pipeline for grouping text in the wild using three complementary strategies. The first strategy uses a deep learning model that estimates planar surfaces in an image and groups text instances that lie on the same geometric plane. The second strategy involves the use of a zero-shot instance segmentation model to group detected texts based on their relation to a specific segmented object. Finally, the third strategy, instead of using a 2D plane or object-oriented association, implements a prompt-based, class-agnostic, zero-shot segmentation model to shift focus to regions of unique semantic understanding.

Our goal is to evaluate how well each of these methods enables contextual grouping of text in natural scenes, offering concrete insights into their relative advantages and disadvantages.

To better support law enforcement and forensic work, accurate object-related grouping of scene text is desirable to reliably link names, plate numbers and relevant information with the physical entities they share. Accordingly, we make the following main contributions:

An efficient, modular pipeline for object-related scene text grouping that establishes a reproducible baseline for detection, segmentation and grouping evaluation.
A comparative study of three strategies that support the pipeline’s text grouping module: planar estimation, zero-shot instance segmentation and class-agnostic promptable segmentation.
A curated dataset sampled from COCOTextv2 [13] and ArT [14], manually annotated with object-level text groupings. To be provided upon request.

2. Related Work

2.1. Text Grouping Strategies

The task of grouping text elements within a single image, often referred to as scene text clustering or layout analysis, is a critical step for comprehensive image understanding. This process goes beyond simple text detection and recognition by enriching the analysis with contextual information. The key challenge here lies in correctly associating multiple text instances as a cluster belonging to the same object.

Traditional approaches to this problem are often seen as approaching a bottom-up, hierarchical clustering methodology. These methods typically begin by detecting atomic text elements, like characters or words and then agglomerating them into larger groups, such as lines or paragraphs. The grouping logic is frequently based on a set of hand-crafted rules: alignment consistency, spacing, colour similarity, stroke width and font size [15,16]. While effective for well-structured layouts, these heuristic-based methods can be brittle and struggle with the complex, arbitrary layouts found in many real-world scenes.

To overcome these issues, the development of methods that learn the association rules from data was proposed. A significant advancement in this area came from the application of deep metric learning [17]. Inspired by techniques in person re-identification, this paradigm focuses on training a model to transform detected text blocks into a high-dimensional embedding space. The model is optimised in a way that the distance between the embeddings of text blocks belonging to the same logical group is minimised, while the distance between those from different groups is maximised. Clustering then becomes a simple and efficient nearest-neighbour search in this latent space.

A particularly powerful and flexible paradigm for modelling the structural relationships between text elements is the use of graph-based methods [18,19]. In this approach, an image layout is represented as a graph where detected text blocks are nodes and their spatial relationships, like proximity, alignment and overlap, are encoded as edges. Graph Neural Networks (GNNs) are a natural fit for this representation, as they can reason over the graph structure to learn the underlying layout rules implicitly. By passing messages between nodes, GNNs enrich each text block’s feature representation with contextual information from its neighbours, leading to more robust clustering. For documents with nested structures, such as words within lines and lines within paragraphs, hierarchical GNNs offer a more sophisticated solution. Models like Hi-LANDER [20] learn to perform clustering in a bottom-up fashion, iteratively merging nodes into “super-nodes” that represent higher levels of the layout hierarchy. Crucially, the criteria for merging nodes are not predefined but are learned from a supervised training set, allowing the model to discover the natural hierarchical structure of a document in a data-driven manner.

Today’s state-of-the-art has moved toward fully unified, end-to-end models, often based on Transformer architectures, that jointly perform text detection and layout clustering. A prominent example is the work of Long et al. [11], which extends a panoptic segmentation framework to output both text-instance masks and a learned pairwise affinity matrix. Each entry in this matrix encodes the probability that two text instances belong to the same group. During inference, a simple threshold and union-find algorithm recover the final clusters—eliminating separate post-processing stages and drastically reducing error accumulation. Other recent efforts have continued this trend toward unified modelling. Bi et al. [21] introduced the Text Grouping Adapter, which adapts pre-trained text detectors for joint layout and grouping tasks through lightweight adapter modules, effectively bridging detection and structural analysis. Long et al. [22] proposed a Hierarchical Text Spotter that performs text detection, recognition and layout clustering within a single Transformer backbone using multi-level feature fusion. More recently, Zhang et al. [12] presented Hi-SAM, which integrates the Segment Anything Model (SAM) for hierarchical text segmentation, demonstrating how prompt-based vision transformers can generalise to scene-text grouping.

While previous works have focused on grouping scene text using visual layout or learned heuristic affinities [11], these approaches often fail to capture an essential part of natural scenes: the physical surface on which the text appears. In many real-world contexts, such as signs, flags, or vehicles, text is distributed across irregular layouts yet belongs to the same unit defined by a shared physical surface. Our work tries to study and address this by comparing grouping strategies based on planar surface association and object segmentation.

2.2. Planar Surface Estimation

Recognising planar surfaces in a scene is a well-studied problem in computer vision. Early methods such as RANSAC [23], region growing [24] and Hough Transform techniques were widely used. These approaches operate by sampling minimal sets of points to generate plane hypotheses and grow or refine them. While these classical plane extraction methods laid important foundations, they generally require dense and accurate 3D data from stereo vision or Light Detection and Ranging (LiDAR) technologies, being at the same time computationally expensive. These methods operate purely on geometric features, lacking object semantics and being unaware of the scene context.

The deep learning methods emerged in order to overcome these rigidities and scalability issues, bringing better generalisation, semantic reasoning and the ability to work directly with RGB image-only inputs. The first end-to-end neural method, PlaneNet [25], predicts a set of plane parameters and segmentation masks from one image. PlaneNet learns to estimate planar depth maps without relying on explicit 3D data, outperforming prior baselines in planar segmentation. PlaneR-CNN [26] built upon Mask R-CNN to detect and reconstruct an arbitrary number of planes from an image. More recently, PlaneRecNet [27] introduced multi-task learning and cross-task consistency to jointly predict planes, depth and segmentation, showing improved performance in complex indoor and outdoor scenes.

These methods demonstrate that deep networks have become reliable tools for identifying surfaces of interest, such as building walls, tabletops and billboards. We leverage this insight by using a planar reconstruction network to discover scene surfaces and group text lying on the same planar surface (like a billboard or vehicle side).

2.3. Object-Related Image Segmentation

Image segmentation refers to the task of assigning every pixel in an image to a visually meaningful area, facilitating tasks such as scene understanding, medical diagnosis and autonomous navigation. Over the past decade, segmentation has progressed from purely pixel-oriented methods to sophisticated, object-aware, instance-level segmentation.

Early semantic segmentation relied on labelling each pixel independently, using sliding windows or handcrafted features [28,29]. The introduction of Fully Convolutional Networks (FCNs) [30] marked a new paradigm, replacing manual pipelines with end-to-end trainable models.

Moving beyond pixel labelling, context was added using deeper models (ResNet backbones) and pyramid pooling to handle a more varied set of images. DeepLab [31] and PSPNet [32], for instance, employed atrous convolutions and pyramid pooling to capture context at different scales, such as observed through binoculars and a wide-angle lens at the same time, improving semantic segmentation performance. Instance segmentation further advanced the field by separating multiple objects of the same class. Mask R-CNN [33] added a parallel mask head to Faster R-CNN, becoming more object-related and richer in contextual reasoning. Similarly, the YOLO [34] family provides real-time instance segmentation with a strong balance of accuracy and speed. Transformer-based models such as Deformable DETR [35] and RT-DETR [36] further advance instance and panoptic segmentation (unified semantic and instance objectives within a single framework) by predicting object masks end-to-end with attention mechanisms, offering strong generalisation and robustness in complex scenes.

Lately, the Segment Anything Model (SAM) [37] and SAM2 [38] demonstrated that, when trained on billions of masks, a promptable and zero-shot model is capable of segmenting anything without further fine-tuning, enabling class-agnostic generalisation for diverse real-world objects.

3. Materials and Methods

At the core of this study, we present a pipeline that processes an image as input and outputs text instances grouped in an object-aware manner. The pipeline was designed to be modular, allowing the central grouping component to be instantiated by one of three alternative strategies. We evaluate these three distinct strategies to establish their specific operational trade-offs. This deliberate, human-in-the-loop approach aims to provide Law Enforcement Agents (LEAs) with evidence-based guidelines for selecting the most appropriate strategy, whether prioritising geometric reliability, computational efficiency, or broad generalisation, depending on the specific constraints of their investigative scenario.

The pipeline, as illustrated in Figure 3, begins with a word-level text detection stage that is common across all strategies. This stage is responsible for detecting and extracting individual text instances in the wild. In this case, we employ a state-of-the-art scene text detector, FAST [39], renowned for its faster inference speed and high accuracy compared to its counterparts. The output of this initial stage is a set of K word-level bounding boxes, denoted

B = \{B_{W, 1}, B_{W, 2}, \dots, B_{W, K}\}

, where each

B_{W, k}

is defined by its coordinates. This set B serves as the common input for all subsequent user-selected grouping strategies.

3.1. Strategy 1: Planar Surface Grouping

This strategy is based on the idea that text instances appearing on the same physical plane are contextually related and should belong to the same group. This is a common supposition in man-made environments, where text is frequently displayed on flat surfaces such as building surfaces, billboards, vehicle panels and signs. To implement this strategy, we utilise PlaneRecNet [27], a multi-task learning framework designed for piecewise 2D planar estimation from a single RGB image. In our experiments, we use the official pre-trained weights and the default hyperparameter configuration as specified in the original publication. PlaneRecNet predicts planar segmentation masks, depth maps and plane parameters.

The process followed for this strategy is as follows: the input image is first processed by a pre-trained planar estimation model. This step generates a set of N planar instance segmentation masks, denoted as

M_{P} = \{M_{P, 1}, M_{P, 2}, \dots, M_{P, N}\}

, where each mask corresponds to a distinct planar surface detected in the scene. While overlaps are rare, we account for the possibility that the estimator may occasionally return partially overlapping masks. For each word bounding box,

B_{W}

, produced by the initial text detection stage, we compute its spatial overlap with every generated planar mask

M_{P, i}

. To do this, the system first identifies candidate masks

M_{c a n d i d a t e s}

that contain the centroid of

B_{W, k}

. Then, for each candidate mask

M_{P, i}

, we calculate the pixel overlap score

O (B_{W, k}, M_{P, i})

, defined as the total count of intersecting pixels within the bounding box area. Following the Algorithm 1, the text instance is associated with the mask that maximises this score:

A [k] \leftarrow a r g_{M_{p, i} \in M_{c a n d i d a t e s}} m a x (O (B_{W, k}, M_{P, i}))

Finally, all words that are grouped with the same planar mask

M_{P, i}

are assigned to a common contextual group,

G_{i}

. Words whose centroids are not contained within any detected planar mask share their own separate group.

This approach offers a purely geometric method for text grouping, independent of the semantic content of the text or the object it appears on, providing a strong baseline for comparison.

Algorithm 1 Clustering method of bounding-box to mask

Require: Segmentation masks M, word bounding boxes B, original image size

O_{s i z e}

Ensure: Association vector A, where

A [i]

is the index of the mask linked to

B_{i}

(or

- 1

if none)

1:: Initialise $A \leftarrow - 1$ for all boxes $B_{i}$
2:: Re-scale B coordinates to match the resolution of M
3:: for each bounding box $B_{i} \in B$ do
4:: Compute centroid $(c_{x}, c_{y})$ and map to mask coordinates $(m_{x}, m_{y})$
5:: Find candidate masks $M_{i d x}$ that contain $(m_{x}, m_{y})$
6:: if $M_{i d x}$ is empty then
7:: continue
8:: end if
9:: for each candidate mask $M_{j} \in M_{i d x}$ do
10:: Compute overlap score $s_{j}$ (pixel intersection within ROI)
11:: end for
12:: Assign $A [i] \leftarrow arg {max}_{j} (s_{j})$ if $s_{j} > t$
13:: end for
14:: Words sharing the same mask index in A are grouped contextually

3.2. Strategy 2: Instance Segmentation Grouping

The second strategy changes from a geometric to a semantic understanding of the scene. For that, we used a multi-class model for instance segmentation: YOLOv11 [40]. We employ the model pre-trained on the COCO dataset and perform inference using the default hyperparameter configuration provided by the Ultralytics framework, without additional fine-tuning.

First, the input image is processed by a pre-trained instance segmentation model. In this case, the YOLOv11 model is trained on the COCO dataset [41], which includes 80 common object categories such as ‘car’, ‘bus’, ‘truck’ and ‘stop sign’. The output is a set of K instance segmentation masks,

M_{I} = \{M_{I, 1}, M_{I, 2}, \dots, M_{I, K}\}

, each with a predicted class label and a confidence score.

Following the mask association approach of the first strategy, Algorithm 1, we associate each detected word

B_{W, k}

with the predicted instance mask

M_{I, j}

that maximises the pixel overlap score

O (B_{W, k}, M_{I, j})

among candidate masks. All words associated with the same instance mask

M_{I, j}

are assigned to the same contextual group,

G_{j}

.

The efficiency of this strategy is highly related to the variance and use-case relevance of the object classes in the model training dataset. While this approach proves robust for common objects represented in the training dataset, its performance on new or unusual object categories is expected to drop.

3.3. Strategy 3: Prompt-Based Segmentation Grouping

Our third strategy uses a foundation model for segmentation, specifically the SAM2 [38]. We use the official pre-trained model checkpoint and the default inference parameters defined in the original implementation. Unlike traditional segmentation models that were trained to recognise a fixed set of classes, SAM2 is a promptable, class-agnostic system designed to generate a segmentation mask for any probable object in the scene. This allows for better generalisation to objects and scenes not encountered during its training.

In this approach, we prompt SAM2 with the four corner points of each word’s bounding box (Figure 4) as positive point prompts; in doing so, each word generates a mask highly related to its underlying surface (Figure 5). The spatial relationship between these masks is then used to predict the groups they belong to (Figure 6). With this method, we try to harness SAM2’s capabilities of understanding “objects” to better extract the physical surfaces on which text resides, regardless of the object class.

The input image I is first processed by a Vision Transformer (ViT-H) encoder to produce a single image embedding

E_{i m g}

that encapsulates the visual information of the entire scene. This encoding is performed only once per image, lowering its computational cost for the inference, with bounding boxes used later as prompts.

For each word bounding box

B_{W, k}

, its four corner points are extracted as positive point prompts

P_{p r o m p t s}

and passed to the SAM decoder. This process is further explained in Algorithm 2 and generates a word-specific segmentation mask

M_{W, k}

:

M_{w, k} \leftarrow M a s k D e c o d e r (E_{i m a g e}, P_{p r o m p t s}, L_{l a b e l s})

Algorithm 2 SAM2 Prompt Engineering

Require: Input Image I, Set of detected word bounding boxes

B_{W}

Ensure: Set of word-specific segmentation masks

M_{W} = {M_{W, 1}, M_{W, 2}, \dots, M_{W, K}}

1:: Load SAM2 model with ViT-H backbone
2:: Generate the image embedding $E_{i m g}$ for the whole image once
3:: for each bounding box $B_{W, k} \in B_{W}$ do
4:: Extract coordinates from $B_{W, k}$ :
5:: $p_{t o p l e f t} \leftarrow (x_{t l}, y_{t l})$
6:: $p_{t o p r i g h t} \leftarrow (x_{t r}, y_{t r})$
7:: $p_{b o t t o m r i g h t} \leftarrow (x_{b r}, y_{b r})$
8:: $p_{b o t t o m l e f t} \leftarrow (x_{b l}, y_{b l})$
9:: Construct Prompt:
10:: $P_{p r o m p t s} \leftarrow [p_{t o p l e f t}, p_{t o p r i g h t}, p_{b o t t o m r i g h t}, p_{b o t t o m l e f t}]$
11:: $L_{l a b e l s} \leftarrow [1, 1, 1, 1]$
12:: Inference:
13:: $M_{W, k} \leftarrow MaskDecoder (E_{i m g}, P_{p r o m p t s}, L_{l a b e l s})$
14:: $Append (M_{W, k}) to M_{W}$
15:: end for
16:: Return $M_{W}$

The decoder combines the prompt embedding with the previously computed image embedding to generate a word-specific segmentation mask, resulting in a set of M masks,

\{M_{W, 1}, M_{W, 2}, \dots, M_{W, K}\}

. To determine the associations, we use the Algorithm 3. Two word-specific masks,

M_{W, i}

and

M_{W, j}

, are clustered into the same group if their spatial overlap, measured by the Intersection over Union (IoU), exceeds a predefined threshold t:

I o U (M_{W, i}, M_{W, j}) > t

Based on the sensitivity analysis (Section 4.5), the overlap threshold t is set to

0.015

to best merge neighbouring text instances belonging to the same object.

This strategy stands out for its class-agnostic nature and generalisation capacity, making it particularly promising for the unpredictable and diverse images encountered in investigative work.

Algorithm 3 Clustering method of masks by overlap

Require: Word masks

M_{W}

, overlap threshold t (e.g., 0.01)

Ensure: Association list A, where

A [i]

is the group index of mask

M_{i}

1:: Initialise $A \leftarrow - 1$ for all masks
2:: for each mask $M_{i} \in M_{W}$ do
3:: if $M_{i}$ already assigned then
4:: continue
5:: end if
6:: Assign new group index I to $M_{i}$
7:: for each other mask $M_{j} \in M_{W}$ do
8:: if $M_{j}$ already assigned then
9:: continue
10:: end if
11:: if $I o U (M_{i}, M_{j}) \geq t$ then
12:: Assign $M_{j}$ to group I
13:: end if
14:: end for
15:: end for

4. Results

In this section, first, we will address the experimental setup, covering the dataset and evaluation metrics. Then we will follow with an analysis of the quantitative and qualitative results.

4.1. Dataset

The evaluations were done on a custom dataset of 1164 images; this dataset allows us to evaluate the feasibility of object-related text grouping and establish initial baselines for larger-scale studies. This dataset was extracted partly from the COCO-Text [13] dataset and the ArT [14] dataset, selecting challenging and representative images of scenarios containing text in the wild. The images feature complex scenes with the presence of multiple text instances, including vehicles with commercial livery, storefronts with intricate signage, signs, billboards and so on. The selection criteria prioritised images where multiple text instances appear on the same object or planar surface, ensuring a focus on contextual text grouping. Additional rules for the curation included high subjective scene complexity, text density and diversity of object types, capturing real cases of object-related text instances.

The dataset contains annotations of the text bounding box and the group they belong to. We annotate groups at the object level. For a set of text instances in the image, we agree that text shares the same ground-truth group if they are printed on, painted on, attached to, or otherwise physically integrated into the same physical object or the same contiguous planar (or smoothly curved) surface of an object, ignoring their textual content, font, size, alignment, or line breaks.

Furthermore, the dataset also contains the transcription of each text and while it is not directly used for the grouping evaluation in this study, it provides an opportunity to further explore the implementation of improved grouping methods that also take the semantic meaning of the text into account.

4.2. Experimentation Setup

The experiments were conducted on a server with the following configuration: Ubuntu 22.04.3 LTS, CUDA v11.5 and an Intel Xeon Gold 6230 processor. The system is equipped with 755 GB of RAM, with a maximum registered RAM usage of 7 GB (6.79 GB). The server also features NVIDIA Tesla V100 GPUs, each with 32 GB of VRAM, with a maximum GPU memory usage recorded at 17.6 GB.

4.3. Evaluation Metrics

The main problem of evaluating a pipeline with multiple stages is the potential for error propagation. Poor performance in the first stage, in this case, text detection, can unfairly penalise an efficient grouping stage. In this regard, the analysis stages were split to allow for initial assessment of the text detection phase, followed by the evaluation of the contextual text grouping methods, resulting in an unbiased analysis of the grouping task.

In the first stage, we quantify the performance of FAST [39], our selected text detection model, by analysing the precision, recall and F1 metrics on our custom dataset. A detected bounding box is considered a True Positive (TP) if its Intersection over Union (IoU) with a ground truth bounding box is greater than or equal to 0.5. A detected box that fails to meet this threshold with any ground truth box is considered a False Positive (FP) and a ground truth box for which no detection meets the threshold is considered a False Negative (FN). The choice of an IoU threshold of 0.5 was determined empirically through systematic experimentation. While lower thresholds (e.g., 0.3) may tolerate imprecise localisation and include loosely aligned detections, higher thresholds (e.g., 0.7) tend to penalise correct detections that are slightly misaligned due to perspective distortion or annotation inconsistencies. Through iterative evaluation, an IoU of 0.5 provided an optimal balance between precision and recall, ensuring that detected text regions meaningfully overlap with the corresponding ground-truth boxes while providing reliable input for the grouping stage. This threshold is also consistent with common practice in object detection research, where IoU values equal to or above 0.5 are typically considered as a good prediction. We use

I o U \geq 0.5

as the inclusion criterion to define a correct estimation of a text bounding box.

The second and more critical stage of the evaluation addresses the performance of the grouping strategies. To ensure a fair comparison so that the results are not stained by detection errors, the metrics are calculated exclusively on detected words that correctly match a ground-truth instance. In doing so, we are focusing on measuring the capability of each strategy to correctly associate words. We use three complementary sets of metrics known for clustering evaluation.

Group Ratio: We first use a straightforward metric, the Group Ratio, to provide supplementary insight into the overall variation fidelity of the clustering algorithms. The Group Ratio is defined as the ratio of the number of predicted groups

N_{p r e d i c t e d}

to the number of ground-truth groups

N_{G T}

in the matched word set, quantifying the fidelity of the algorithm in estimating the number of groups:

G r o u p R a t i o = \frac{N_{p r e d i c t e d}}{N_{G T}}

We choose this metric because it directly quantifies the tendency of a grouping strategy towards over-segmentation or under-segmentation. By calculating the Group Ratio, we aim to quickly assess whether a model is accurately determining the number of distinct object-related text clusters. An ideal Group Ratio is 1.0, indicating that the predicted clustering produced the same number of groups as the ground truth. A ratio significantly less than 1.0 indicates under-segmentation, meaning the model is merging distinct ground-truth groups together.

Precision, Recall and F1-Score: By using these metrics, we see the clustering problem as a series of pairwise decisions. For any two matched words, the task is to decide if they belong in the same group. The problem here is that for computing this metric, we need to count every possible item pair and when the ground-truth contains one or many large clusters, the number of positive pairs grows quadratically. Any wrong association of that big cluster will produce a huge number of false negatives, which will make the recall really low, even though many of the groupings could be correctly predicted. In order to make this metric less misleading, we will use the B-Cubed algorithm [42]. The algorithm compares a predicted clustering to a ground truth clustering. For each element, the predicted and ground truth clusters containing the element are compared and then the mean over all elements is taken:

P r e c i s i o n = \frac{1}{\sum_{e l e m e n t s}} \sum_{i = 1}^{n} \frac{{(c o u n t o f e l e m e n t)}^{2}}{c o u n t o f a l l e l e m e n t s i n c l u s t e r}

R e c a l l = \frac{1}{\sum_{e l e m e n t s}} \sum_{i = 1}^{n} \frac{{(c o u n t o f e l e m e n t)}^{2}}{c o u n t o f t o t a l e l e m e n t s f r o m t h i s c a t e g o r y}

F_{s c o r e} = \frac{1}{k} \sum_{i = 1}^{n} \frac{2 * P r e c i s i o n {(C)}_{k} * R e c a l l {(C)}_{k}}{P r e c i s i o n {(C)}_{k} + R e c a l l {(C)}_{k}}

n denotes the number of categories in the cluster and k is the number of predicted clusters.

P r e c i s i o n {(C)}_{k}

and

R e c a l l {(C)}_{k}

are the ‘partial’ precision and recalls for each cluster. This way, it is possible to provide a fine-grained analysis of grouping errors.

B-Cubed F1 scores measure internal grouping consistency or local purity. A high B-Cubed score suggests that when the model assigns two bounding boxes to the same group, they are usually from the same ground-truth group (high precision) and that many individual boxes are correctly assigned (high B-Cubed recall).

Adjusted Rand Index (ARI) [43]: To measure the quality of agreement between two clusterings, the Rand Index (RI) [44] was first introduced as an effective metric for evaluating how well a model can estimate clusters. It is calculated using the contingency table of the two classifications and estimates the likelihood that a randomly selected pair of instances is “coherent”, meaning that they were placed in the same cluster in both partitions or in different clusters in both partitions. Nevertheless, this approach introduces the problem that the RI’s raw value heavily depends on the number of clusters, making it, at times, hard to interpret. The Adjusted Rand Index (ARI) fixes this by correcting the RI for chance. The ARI subtracts the expected RI under an assumption of independent (random) clustering and rescales the result. This adjustment makes the ARI a more robust and interpretable measure of clustering similarity, particularly when the number of clusters differs between the two partitions.

A R I = \frac{R I - E x p e c t e d_R I}{m a x (R I) - E x p e c t e d_R I}

The ARI score ranges from −0.5 to 1.0. A score of 1.0 signifies perfect agreement between the predicted and ground-truth groupings. A score of 0.0 indicates that the clustering performance is no better than random chance. Negative scores indicate that the clustering is worse than random. This score is highly effective for a top-level comparison of the overall clustering predicted by each strategy. It measures the overall structural coherence and highlights variations in the global group assignments.

4.4. Quantitative Results

Before evaluating the grouping strategies, it is crucial to quantify the performance of the text detection module. The accuracy of this initial stage directly impacts the maximum achievable performance of the entire pipeline, since undetected words cannot be grouped. Table 1 summarises the performance of the FAST [39] detector on our curated dataset.

The results in Table 1 validate the use of a strong text detection model, the FAST. With a recall of 0.71, the detector successfully identifies 71% of the ground truth text instances, meaning that 29% of words are missed from the outset (False Negatives). The precision of 0.90 indicates that the vast majority of detections are correct, with a smaller number of False Positives. This baseline performance is interesting for understanding that the next grouping evaluation, which is done with correctly matched detected words (True Positives), represents an analysis of the grouping strategies under almost ideal detection conditions.

The central results of this study are presented in Table 2. This table provides a comparison of the state-of-the-art Hi-SAM text grouping implementation and the grouping strategies presented in this study, i.e., Planar Surface Association, Zero-shot Multi-class Instance Segmentation Association and Prompt-Based Segmentation Association, using the grouping quality metrics defined previously. The metrics were calculated only on the set of words correctly identified by the text detector.

As a baseline of the evaluation, we include Hi-SAM, a recent state-of-the-art method that relates text via hierarchical relations (paragraph, lines, words) rather than object association. In our experiments, we group words at the paragraph level, i.e., words are associated if they belong to the same predicted paragraph. In Table 2, Hi-SAM shows a B-Cubed Precision of 1.00. This perfect score shows that when the model associates any two bounding boxes, those boxes are almost always correctly linked according to the ground truth. This reflects Hi-SAM’s strength, which is related to its foundation in hierarchical text segmentation and layout analysis, making it effective at grouping relations based on semantic structure. However, this strategy reveals limitations when analysed through the specific metrics for object-related grouping. Hi-SAM has the lowest Group Ratio (60%) and the lowest B-Cubed Recall (65%) among all strategies. Since the study focuses on associating text instances based on the physical object they share, this low ratio indicates a severe under-segmentation of the underlying physical objects sharing the text instances. This behaviour confirms that while strategies focused on semantic-related associations do excel at finding cohesive groups, they struggle with the object-related grouping task. Despite the low performance measured through object-related metrics, Hi-SAM still achieves an Adjusted Rand Index (ARI) of 0.19, which is better than both the PlaneRecNet (ratio of 0.15) and the YOLOv11 (ratio of 0.08) strategies. This result suggests that models like Hi-SAM, being strong in learning relational structures like semantics or layout, offer valuable capabilities that can improve object-related text grouping solutions if integrated with object-related context understanding.

The evaluation of the object-related grouping strategies of this study reveals consistent performance across the three grouping strategies. The B-Cubed metric, which captures both per-text precision and recall, indicates that all models achieve strong internal grouping consistency, with F1 values above 0.90. However, the ARI and pairwise metrics highlight greater variation in the overall clustering structure, suggesting that while local associations are often correct, some models struggle with global group assignments.

This makes sense because pairwise metrics operate on every possible object pair and therefore scale quadratically with cluster sizes: a single large ground-truth group that is split by the grouping algorithm produces a large number of combinations of pairwise false negatives. High B-Cubed performance combined with relatively low pairwise and ARI suggests that predicted groups are often locally pure—i.e., when the model puts two boxes together, they are usually from the same ground-truth group (high precision) and many individual boxes are placed correctly (high B-Cubed recall). However, the model also breaks large ground-truth groups into several smaller predicted clusters or misses some group members, which can heavily penalise pairwise recall and ARI, which account for every missing pair inside a ground-truth group.

A lightweight run-time benchmark was also conducted to assess the computational cost of each strategy on a NVIDIA Tesla V100. Average processing times per image are reported in Table 3.

The performance results in Table 3 confirm the expected trade-offs among the tested approaches. The semantic layout and class-agnostic segmentation methods, Hi-SAM and SAM2, while presenting the best results in terms of overall structural coherence (ARI of 0.19 and 0.23, respectively) and confirming their robustness in diverse and complex scenes, lack an efficient implementation for high-speed and high data volume analysis. YOLOv11 maintains solid accuracy in predicting groups while offering the highest inference speed, making it suitable for large-scale data analysis. PlaneRecNet remains comparable in performance to the object detection method of YOLOv11.

4.5. Parameter Sensitivity Analysis

Ideally, strong clustering performance of the SAM2 strategy is indicated by the Adjusted Rand Index (ARI) and other clustering metrics remaining steady across multiple IoU thresholds related to the generation of object masks. If an IoU change from

0.015

to

0.1

causes a significant drop in the evaluated metrics, it may suggest that the model’s reliance on very minimal mask overlap for associating texts represents a point of instability. Conversely, stable metrics demonstrate the reliability of using word bounding boxes as prompts to consistently generate masks for the underlying physical surface.

The sensitivity analysis in Table 4 shows that the SAM2 strategy maintains stable global grouping consistency (ARI of 0.23) across different IoU thresholds, while local grouping metrics (Pairwise F1 and B-Cubed F1) are more threshold-dependent. Lowering the IoU threshold from 0.1 to 0.015 increases both F1 scores (from 0.774 to 0.823 and from 0.919 to 0.951), primarily due to higher recall. These results support the selection of a low IoU threshold (t = 0.015) for the SAM2 strategy, where minimal IoU overlap best merges neighbouring text instances belonging to the same object. Despite fluctuations in F1 metrics, ARI remains almost unchanged, confirming the robustness and generalisation capacity of the approach.

4.6. Qualitative Results

Results of a qualitative analysis are demonstrated in Figure 7, where, at the left column, a correct association of the text on the military plane is depicted. All the strategies consider the whole aeroplane as a single 2D planar surface or an object. When dealing with a more complex perspective of highway signs (Figure 7, right), the planar estimation strategy effectively treats both highway signs as belonging to the same 2D plane. While being technically true, it leads to wrongly associating text that does not truly share the same object and represents distinct information. In the middle image, the zero-shot multi-class instance segmentation strategy clearly demonstrates the downside of relying on a fixed set of classes; the model correctly segments the cars but fails to recognise the highway sign as an object. Finally, the promptable class-agnostic strategy is the only one that correctly segments the regions corresponding to each text according to the object they belong to.

In Figure 8, we can further illustrate how differently the strategies behave. In the left column (the aeroplane image), the planar estimation model fails to identify the 2D planar surface of the wings. A correct association is obtained by chance, considering that the detected text is not contained within any detected 2D plane of the image and is thus finally associated with the remainder group of unassociated text instances. The instance segmentator associates the text correctly as it detects the aeroplane without issues. Nevertheless, the class-agnostic promptable segmentation model struggles and although it seems to segment the “object” behind the text correctly, it does not find a unified segmentation for the whole aeroplane object and thus ends up associating each text with a different wing of the aeroplane.

In the second example (right), the promptable segmentation model shows a similar behaviour: it groups text according to the sticker’s contour, but in doing so, it fails to associate text with the larger object, i.e., the bagel. The other two models, by contrast, correctly distinguish the separate food items.

To better understand the limitations of our text grouping strategies, we present a couple of real cases with representative failures of each grouping strategy. In Figure 9 (left), the 2D planar estimation algorithm mistakenly associates text instances that do not belong to the same object, i.e., the shop’s name on the building’s façade and the text on the advertising board. This occurs when multiple text instances lie on geometrically similar surfaces, such as separate signs aligned along the same perspective plane, even though these surfaces represent distinct functional objects. In Figure 9 (right), we observe that the planar estimation may fail to correctly estimate the 2D planar surface below the text instances, as is the case with the hanging billboard.

The multi-class instance segmentation, as depicted in Figure 10, struggles when faced with new or unusual object categories, i.e., the informational hanging signs and billboards, demonstrating its performance being limited by the fixed classes it was trained on.

In Figure 11, the class-agnostic, prompt-based approach demonstrates its limitations when generating masks that correspond only to the immediate surface surrounding the text rather than the complete, unified object. This is the case in Figure 11 (left image-plane example) and the store name (right image). The plane example groups the text instances in a semantically reasonable way (one word located near the motor and another near the cockpit), but this remains a poor association for object-level grouping: words printed on the same physical object are split into multiple masks, preventing a unified association. The store name also splits the text into two parts due to the occlusion and shadows caused by the traffic sign.

4.7. End-to-End Error Propagation Analysis

To assess how the detection module inaccuracies influence the final grouping quality, we evaluated an end-to-end simulation by injecting three types of detection errors (drop, merge, split) at varying rates (0.2, 0.5) into the detected text instances.

Drops: the detector misses some text instances (True Positives → False Negatives). This lowers the number of matched text instances and typically reduces recall-oriented grouping metrics.
Merge: the detector merges multiple neighbouring words/lines into one predicted group (over-merging). This hurts precision-related grouping metrics.
Split: the detector fragments a GT cluster into many small predicted clusters (fragmentation). This hurts both pairwise and clustering measures in different ways.

Table 5 summarises how a bad text detection module could affect the downstream grouping metrics for the three evaluated strategies. Overall, the drop errors have the mildest impact, maintaining relatively stable B-Cubed and pairwise F1-scores even at 20% omission rates (PlaneRecNet: 0.93; YOLO: 0.92; SAM2: 0.95). Merge errors primarily degrade the Adjusted Rand Index (up to −39%), indicating confusion in global cluster assignments despite locally consistent associations. Split errors affect the overall performance the most, reducing pairwise F1-scores by 60–65% at higher rates, as text instances from the same object become fragmented across multiple groups. End-to-end behaviour follows realistic degradation patterns and, with it, highlights the importance of robust text detection for stable grouping performance. Among the evaluated models, SAM2 shows the highest resilience to detection noise, followed by YOLO and PlaneRecNet.

5. Conclusions

This paper has presented a modular pipeline and a comparative study for the task of contextual scene text grouping, a critical task for improving the visual intelligence tools used by Law Enforcement Agents (LEAs). Moving aside from other known grouping works focused on semantics or text layout, this study focuses on showcasing the association of text instances based on the physical objects they share. The primary contribution is the evaluation of three distinct strategies that tackle different approaches to this problem: a geometric method using planar surface reconstruction, a zero-shot semantic instance-based method and a prompt-based class-agnostic method using a visual transformer.

The principal findings show a clear and significant trade-off between computational efficiency, generalisation and grouping accuracy. The prompt-based strategy showed the best performance among the others, taking advantage of its zero-shot generalisation capabilities, enabling it to accurately group text on never-before-seen objects during training. However, this accuracy is achieved at the cost of losing computational efficiency. On the other hand, the strategy based on the multi-class instance segmentation model surpasses the prompt-based strategy in terms of speed. Its performance is strong for common, predefined object categories but is fundamentally limited by the classes it was trained on, failing to generalise to unseen objects. The geometric approach served as the baseline, effective in structured environments and man-made contexts (storefronts, vehicles).

With this study, we further delve into the area of object-related text grouping by providing a practical framework for its application, showcasing that the optimal approach highly depends on the specific operational requirements for the visual scene analysis task. Our results advocate for a tiered workflow in investigative applications, where efficient instance segmentation models like YOLOv11 [40] are used for fast, large-scale data and powerful foundation models like SAM2 [38] can be used for detailed, accurate analysis of high-priority evidence. This work, along with the specialised dataset we have curated, aims to further push the research in contextual scene text understanding, paving the way for more potent and insightful visual intelligence systems that can effectively support law enforcement in an increasingly data-rich world.

Future work will focus on integrating object-related associations with semantic integrity. Once words have been clustered according to the object they are on, semantic embedding (e.g., from CLIP, BERT) can provide an additional layer of association that captures language rules. For example, two word-level text instances such as “pizza” and “delivery” located on the same vehicle surface are not only spatially related but also semantically coherent. By adding the semantic similarity alongside geometric relations, the system could improve the robustness of text grouping. Additionally, we are curating a larger dataset with richer contextual information to ensure the benchmarking process remains as relevant and complex as possible. We will also refine the third strategy by improving the method for determining whether the four input points are definitive or if additional heuristics are needed to better approximate the identification of the actual object on which the text is displayed, ultimately making the text extracted from visual scenes richer and more informative, providing LEAs with actionable insights in a fast and efficient manner. And while the proposed modular pipeline offers flexibility with the three grouping strategies, the current implementation needs manual selection of the most appropriate strategy based on each case’s requirements. This human-in-the-loop approach ensures that the requirements are explicitly considered. However, future work will explore the possibility of implementing a hybrid solution that dynamically selects or combines strategies based on scene characteristics and task requirements.

The datasets employed in this study, such as COCO-Text and ArT19 (ICDAR Robust Reading Competition), are publicly available research benchmarks that primarily focus on the recognition of text instances in the wild and do not contain personally identifiable information. They are used solely for the scientific evaluation of novel methods in a laboratory environment. When increasing the technology readiness level of such systems and integrating them into real-world use cases, additional measures must be applied to be fully compliant with the regulations related to the collection and processing of private data, such as the General Data Protection Regulation (GDPR) or the EU’s Law Enforcement Directive (LED), which are more specific to police investigations. Also, depending on the final use case, e.g., LEA or non-LEA related, a deeper analysis of the bias of the AI models used, having a direct impact on the composition of the training and testing datasets, must be considered in relation to the EU AI Act. By complying with these regulations, future deployments of text-grouping systems can support LEAs while ensuring transparency and accountability in the use of automated visual analysis systems.

Author Contributions

Conceptualisation, J.G., L.U. and E.S.; methodology, E.S. and J.G.; software, E.S.; validation, E.S. and J.G.; formal analysis, E.S. and L.U.; investigation, E.S.; resources, J.G., L.U. and P.L.; data curation, E.S.; writing—original draft preparation, E.S. and L.U.; writing—review and editing, E.S., L.U., J.G. and P.L.; visualisation, E.S. and L.U.; supervision, J.G. and L.U.; project administration, J.G., L.U. and P.L.; funding acquisition, J.G. and P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 101021797. The work described in this paper is performed in the H2020 project STARLIGHT (“Sustainable Autonomy and Resilience for LEAs using AI against High Priority Threats”).

Data Availability Statement

The datasets used in this study include two publicly available research benchmarks. The COCO-TextV2.0 dataset, publicly available at https://bgshih.github.io/cocotext/ (accessed on 15 December 2025), and the ArT19 dataset available at https://rrc.cvc.uab.es/?ch=14 (accessed on 15 December 2025). The dataset generated and evaluated as part of this article is not readily available because of time limitations regarding its internal publication. Requests to access the dataset should be directed to eyshinohara@vicomtech.org.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results. Authors Enrique Shinohara, Jorge García, Luis Unzueta and Peter Leškovský were employed by the company VICOMTECH. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Cao, D.; Zhong, Y.; Wang, L.; He, Y.; Dang, J. Scene Text Detection in Natural Images: A Review. Symmetry 2020, 12, 1956. [Google Scholar] [CrossRef]
Baek, Y.; Lee, B.; Han, D.; Yun, S.; Lee, H. Character Region Awareness for Text Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; Computer Vision Foundation: New York, NY, USA; IEEE: Piscataway, NJ, USA, 2019; pp. 9365–9374. [Google Scholar] [CrossRef]
Liao, M.; Wan, Z.; Yao, C.; Chen, K.; Bai, X. Real-Time Scene Text Detection with Differentiable Binarization. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, New York, NY, USA, 7–12 February 2020; AAAI Press: Washington, DC, USA, 2020; pp. 11474–11481. [Google Scholar] [CrossRef]
Zhao, S.; Quan, R.; Zhu, L.; Yang, Y. CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-Trained Vision-Language Model. IEEE Trans. Image Process. 2024, 33, 6893–6904. [Google Scholar] [CrossRef] [PubMed]
Du, Y.; Chen, Z.; Jia, C.; Yin, X.; Zheng, T.; Li, C.; Du, Y.; Jiang, Y. SVTR: Scene Text Recognition with a Single Visual Model. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23–29 July 2022; Raedt, L.D., Ed.; International Joint Conferences on Artificial Intelligence Organization: California, CA, USA, 2022; pp. 884–890. [Google Scholar] [CrossRef]
Nayef, N.; Liu, C.; Ogier, J.; Patel, Y.; Busta, M.; Chowdhury, P.N.; Karatzas, D.; Khlif, W.; Matas, J.; Pal, U.; et al. ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition—RRC-MLT-2019. In Proceedings of the 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, 20–25 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1582–1587. [Google Scholar] [CrossRef]
Long, S.; Qin, S.; Panteleev, D.; Bissacco, A.; Fujii, Y.; Raptis, M. ICDAR 2023 Competition on Hierarchical Text Detection and Recognition. In Proceedings of the Document Analysis and Recognition-ICDAR 2023-17th International Conference, San José, CA, USA, 21–26 August 2023; Proceedings, Part II. Fink, G.A., Jain, R., Kise, K., Zanibbi, R., Eds.; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2023; Volume 14188, pp. 483–497. [Google Scholar] [CrossRef]
Cheng, Z.; Lu, J.; Zou, B.; Zhou, S.; Wu, F. ICDAR 2021 Competition on Scene Video Text Spotting. In Proceedings of the 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, 5–10 September 2021; Proceedings, Part IV. Lladós, J., Lopresti, D., Uchida, S., Eds.; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2021; Volume 12824, pp. 650–662. [Google Scholar] [CrossRef]
Xue, C.; Huang, J.; Zhang, W.; Lu, S.; Wang, C.; Bai, S. Contextual Text Block Detection Towards Scene Text Understanding. In Proceedings of the Computer Vision-ECCV 2022-17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXVIII. Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2022; Volume 13688, pp. 374–391. [Google Scholar] [CrossRef]
Wang, J.; Zhang, S.; Hu, K.; Ma, C.; Zhong, Z.; Sun, L.; Huo, Q. Dynamic Relation Transformer for Contextual Text Block Detection. In Proceedings of the Document Analysis and Recognition-ICDAR 2024-18th International Conference, Athens, Greece, 30 August–4 September 2024; Proceedings, Part I. Smith, E.H.B., Liwicki, M., Peng, L., Eds.; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2024; Volume 14804, pp. 313–330. [Google Scholar] [CrossRef]
Long, S.; Qin, S.; Panteleev, D.; Bissacco, A.; Fujii, Y.; Raptis, M. Towards End-to-End Unified Scene Text Detection and Layout Analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1039–1049. [Google Scholar] [CrossRef]
Ye, M.; Zhang, J.; Liu, J.; Liu, C.; Yin, B.; Liu, C.; Du, B.; Tao, D. Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 1431–1447. [Google Scholar] [CrossRef] [PubMed]
Veit, A.; Matera, T.; Neumann, L.; Matas, J.; Belongie, S.J. COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images. arXiv 2016, arXiv:1601.07140. [Google Scholar] [CrossRef]
Chng, C.K.; Ding, E.; Liu, J.; Karatzas, D.; Chan, C.S.; Jin, L.; Liu, Y.; Sun, Y.; Ng, C.C.; Luo, C.; et al. ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text—RRC-ArT. In Proceedings of the 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, 20–25 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1571–1576. [Google Scholar] [CrossRef]
O’Gorman, L. The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1993, 15, 1162–1173. [Google Scholar] [CrossRef]
Epshtein, B.; Ofek, E.; Wexler, Y. Detecting text in natural scenes with stroke width transform. In Proceedings of the The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13–18 June 2010; IEEE Computer Society: Washington, DC, USA, 2010; pp. 2963–2970. [Google Scholar] [CrossRef]
Mohan, D.D.; Jawade, B.; Setlur, S.; Govindaraju, V. Deep Metric Learning for Computer Vision: A Brief Overview. arXiv 2023, arXiv:2312.10046. [Google Scholar] [CrossRef]
Liu, S.; Wang, R.; Raptis, M.; Fujii, Y. Unified Line and Paragraph Detection by Graph Convolutional Networks. In Proceedings of the Document Analysis Systems-15th IAPR International Workshop, DAS 2022, La Rochelle, France, 22–25 May 2022; Proceedings. Uchida, S., Smith, E.H.B., Eglin, V., Eds.; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2022; Volume 13237, pp. 33–47. [Google Scholar] [CrossRef]
Wei, S.; Xu, N. PARAGRAPH2GRAPH: A GNN-based framework for layout paragraph analysis. arXiv 2023, arXiv:2304.11810. [Google Scholar] [CrossRef]
Xing, Y.; He, T.; Xiao, T.; Wang, Y.; Xiong, Y.; Xia, W.; Wipf, D.; Zhang, Z.; Soatto, S. Learning Hierarchical Graph Neural Networks for Image Clustering. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3447–3457. [Google Scholar] [CrossRef]
Bi, T.; Zhang, X.; Zhang, Z.; Xie, W.; Lan, C.; Lu, Y.; Zheng, N. Text Grouping Adapter: Adapting Pre-Trained Text Detector for Layout Analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, 16–22 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 28150–28159. [Google Scholar] [CrossRef]
Long, S.; Qin, S.; Fujii, Y.; Bissacco, A.; Raptis, M. Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, 3–8 January 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 892–902. [Google Scholar] [CrossRef]
Fischler, M.A.; Bolles, R.C. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Adams, R.; Bischof, L. Seeded Region Growing. IEEE Trans. Pattern Anal. Mach. Intell. 1994, 16, 641–647. [Google Scholar] [CrossRef]
Liu, C.; Yang, J.; Ceylan, D.; Yumer, E.; Furukawa, Y. PlaneNet: Piece-Wise Planar Reconstruction From a Single RGB Image. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; Computer Vision Foundation: New York, NY, USA; IEEE Computer Society: Washington, DC, USA, 2018; pp. 2579–2588. [Google Scholar] [CrossRef]
Liu, C.; Kim, K.; Gu, J.; Furukawa, Y.; Kautz, J. PlaneRCNN: 3D Plane Detection and Reconstruction From a Single Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; Computer Vision Foundation: New York, NY, USA; IEEE: Piscataway, NJ, USA, 2019; pp. 4450–4459. [Google Scholar] [CrossRef]
Xie, Y.; Shu, F.; Rambach, J.R.; Pagani, A.; Stricker, D. PlaneRecNet: Multi-Task Learning with Cross-Task consistency for Piece-Wise Plane Detection and Reconstruction from a Single RGB Image. In Proceedings of the 32nd British Machine Vision Conference 2021, BMVC 2021, Online, 22–25 November 2021; BMVA Press: Durham, UK, 2021; p. 239. [Google Scholar]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), San Diego, CA, USA, 20–26 June 2005; IEEE Computer Society: Washington, DC, USA, 2005; pp. 886–893. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, 7–12 June 2015; IEEE Computer Society: Washington, DC, USA, 2015; pp. 3431–3440. [Google Scholar] [CrossRef]
Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Washington, DC, USA, 2017; pp. 6230–6239. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R.B. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; IEEE Computer Society: Washington, DC, USA, 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; IEEE Computer Society: Washington, DC, USA, 2016; pp. 779–788. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021; OpenReview.net: Alameda, CA, USA, 2021. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, 16–22 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, 1–6 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 3992–4003. [Google Scholar] [CrossRef]
Ravi, N.; Gabeur, V.; Hu, Y.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. SAM 2: Segment Anything in Images and Videos. In Proceedings of the The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, 24–28 April 2025; OpenReview.net: Alameda, CA, USA, 2025. [Google Scholar]
Chen, Z.; Wang, W.; Xie, E.; Yang, Z.; Lu, T.; Luo, P. FAST: Searching for a Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation. arXiv 2021, arXiv:2111.02394. [Google Scholar]
Jocher, G.; Qiu, J. Ultralytics YOLO11; Ultralytics: Frederick, MD, USA, 2024; Available online: https://docs.ultralytics.com/ (accessed on 15 December 2025).
Lin, T.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision-ECCV 2014-13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V. Fleet, D.J., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. [Google Scholar] [CrossRef]
Bagga, A.; Baldwin, B. Entity-Based Cross-Document Coreferencing Using the Vector Space Model. In Proceedings of the 17th International Conference on Computational Linguistics (COLING 1998), Montreal, QC, Canada, 10–14 August 1998; Association for Computational Linguistics: Stroudsburg, PA, USA, 1998; pp. 79–85. [Google Scholar] [CrossRef]
Morey, L.C.; Agresti, A. The measurement of classification agreement: An adjustment to the Rand statistic for chance agreement. Educ. Psychol. Meas. 1984, 44, 33–37. [Google Scholar] [CrossRef]
Rand, W.M. Objective Criteria for the Evaluation of Clustering Methods. J. Am. Stat. Assoc. 1971, 66, 846–850. [Google Scholar] [CrossRef]

Figure 1. Visualisation of the proposed text-grouping pipeline. The top and bottom rows show two example scenes. For each row, (left)—Optical Character Recognition (OCR) detection; (right)—grouped text instances. First, the pipeline performs word-level detection using OCR, where each text instance is detected individually without context, green boxes in the left images indicate word-level OCR bounding boxes. Then, an object-related grouping step is applied, which clusters the detected text instances by their shared segmented parent object. This step uses the spatial overlap of bounding boxes and masks. Different coloured masks highlight the grouped instances, emphasising which text units are linked together.

Figure 2. Taxonomy of scene text understanding tasks. Typically approached in three stages: low-level (e.g., word or line detection), mid-level (e.g., layout analysis or text block formation) and high-level contextual interpretation (e.g., reading order or semantic reasoning and layout analysis). The physical layout, which can be explained by the relationships between text instances sharing the same physical space, is the focus of this study.

Figure 3. Overview of the proposed modular pipeline for visual scene analysis in investigative data. The background colours distinguish the pipeline stages: input (orange), keyword recognition (blue), grouping strategies (green), and output (red). The central ‘Context-Aware Grouping Strategies’ module (green box) is the primary focus of this study. Law Enforcement Agents (LEAs) should evaluate the trade-offs of latency vs accuracy, e.g., via a Pareto frontier, to select the strategy that best meets task requirements.

Figure 4. Outputs of bounding boxes (green boxes) detected with the Optical Character Recognition (OCR) module.

Figure 5. SAM Masks predicted with the Optical Character Recognition (OCR) bounding boxes as input prompts.

Figure 6. Final combination of masks and bounding boxes (green boxes) where it detects the truck as a whole.

Figure 7. Comparison of contextual text grouping results for the three association strategies. Green boxes indicate detected text instances. (Top)—geometric planar reconstruction, (Middle)—zero-shot multi-class instance segmentation, (Bottom)—class-agnostic promptable segmentation.

Figure 8. Qualitative comparison of the contextual text grouping results for the three association strategies. Green boxes indicate detected text instances. (Top)—geometric planar reconstruction, (Middle)—multi-class instance segmentation, (Bottom)—class-agnostic promptable segmentation.

Figure 9. 2D planar segmentation (PlaneRecNet) qualitative error analysis. Green boxes indicate detected text instances.

Figure 10. Multi-class instance segmentation (YOLOv11) qualitative error analysis. Green boxes indicate detected text instances.

Figure 11. Promptable class-agnostic segmentation (SAM2) qualitative error analysis. Green boxes indicate detected text instances.

Table 1. Prior evaluation of the text detector on the custom dataset.

Metric	Score
Precision	0.90
Recall	0.71
F1	0.80

Table 2. Quantitative analysis of grouping strategies. Higher values indicate better performance across all metrics.

Grouping Strategy	Pairwise Precision	Pairwise Recall	Pairwise F1	Group Ratio
Hi-SAM [12]	0.83	0.45	0.53	0.60
PlaneRecNet [27]	0.81	0.83	0.80	0.85
YOLOv11 [40]	0.77	0.87	0.80	0.85
SAM2 [38]	0.85	0.83	0.82	0.89
Grouping Strategy	(B-Cubed) Precision	(B-Cubed) Recall	(B-Cubed) F1	Adjusted Rand Index
Hi-SAM [12]	1.00	0.65	0.74	0.19
PlaneRecNet [27]	0.93	0.95	0.93	0.15
YOLOv11 [40]	0.90	0.99	0.92	0.08
SAM2 [38]	0.97	0.95	0.95	0.23

Table 3. Performance of the text detection + segmentation + association pipeline.

Grouping Strategy	FPS
Hi-SAM [12]	0.2
PlaneRecNet [27]	6.1
YOLOv11 [40]	7.1
SAM2 [38]	1.4

Table 4. Evaluation of grouping performance under different IoU thresholds (t) of the SAM2 strategy.

IoU Threshold (t)	Pairwise F1	B-Cubed F1	Adjusted Rand Index (ARI)
0.1	0.774	0.919	0.229
0.05	0.800	0.937	0.231
0.015	0.823	0.951	0.225

Table 5. BCubed F1 and ARI under detection errors propagation.

Strategy	Error Type	BCubed F1 (0.2/0.5)	ARI (0.2/0.5)
Planerecnet	Drop	0.932 (+0.7%)/0.949 (+2.6%)	0.159 (+3.2%)/0.128 (−16.7%)
	Merge	0.917 (−0.9%)/0.919 (−0.7%)	0.117 (−24.0%)/0.099 (−35.6%)
	Split	0.798 (−13.7%)/0.627 (−32.2%)	0.115 (−25.4%)/0.063 (−59.0%)
YOLO	Drop	0.927 (+0.7%)/0.943 (+2.5%)	0.084 (+5.1%)/0.061 (−23.3%)
	Merge	0.916 (−0.5%)/0.913 (−0.8%)	0.066 (−17.2%)/0.057 (−29.1%)
	Split	0.804 (−12.6%)/0.628 (−31.8%)	0.061 (−23.2%)/0.049 (−39.0%)
SAM2	Drop	0.954 (+0.3%)/0.968 (+1.7%)	0.217 (−3.6%)/0.167 (−26.1%)
	Merge	0.940 (−1.2%)/0.933 (−1.9%)	0.174 (−22.7%)/0.136 (−39.6%)
	Split	0.817 (−14.1%)/0.639 (−32.8%)	0.161 (−28.7%)/0.093 (−58.9%)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shinohara, E.; García, J.; Unzueta, L.; Leškovský, P. Efficient Object-Related Scene Text Grouping Pipeline for Visual Scene Analysis in Large-Scale Investigative Data. Electronics 2026, 15, 12. https://doi.org/10.3390/electronics15010012

AMA Style

Shinohara E, García J, Unzueta L, Leškovský P. Efficient Object-Related Scene Text Grouping Pipeline for Visual Scene Analysis in Large-Scale Investigative Data. Electronics. 2026; 15(1):12. https://doi.org/10.3390/electronics15010012

Chicago/Turabian Style

Shinohara, Enrique, Jorge García, Luis Unzueta, and Peter Leškovský. 2026. "Efficient Object-Related Scene Text Grouping Pipeline for Visual Scene Analysis in Large-Scale Investigative Data" Electronics 15, no. 1: 12. https://doi.org/10.3390/electronics15010012

APA Style

Shinohara, E., García, J., Unzueta, L., & Leškovský, P. (2026). Efficient Object-Related Scene Text Grouping Pipeline for Visual Scene Analysis in Large-Scale Investigative Data. Electronics, 15(1), 12. https://doi.org/10.3390/electronics15010012

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Object-Related Scene Text Grouping Pipeline for Visual Scene Analysis in Large-Scale Investigative Data

Abstract

1. Introduction

2. Related Work

2.1. Text Grouping Strategies

2.2. Planar Surface Estimation

2.3. Object-Related Image Segmentation

3. Materials and Methods

3.1. Strategy 1: Planar Surface Grouping

3.2. Strategy 2: Instance Segmentation Grouping

3.3. Strategy 3: Prompt-Based Segmentation Grouping

4. Results

4.1. Dataset

4.2. Experimentation Setup

4.3. Evaluation Metrics

4.4. Quantitative Results

4.5. Parameter Sensitivity Analysis

4.6. Qualitative Results

4.7. End-to-End Error Propagation Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI