Document Image Verification Based on Paragraph Alignment and Subtle Change Detection

Li, Daoquan; Jia, Weifei; Yu, Quanlin; Hu, Zhaoxu

doi:10.3390/app152312430

Open AccessArticle

Document Image Verification Based on Paragraph Alignment and Subtle Change Detection

School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266520, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(23), 12430; https://doi.org/10.3390/app152312430

Submission received: 1 October 2025 / Revised: 14 November 2025 / Accepted: 20 November 2025 / Published: 23 November 2025

(This article belongs to the Topic Image Processing, Signal Processing and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

The digitization of paper documents enables rapid sharing and long-term preservation of information, making it a widely adopted approach for efficient document storage and management across various domains. However, the recent advances in image editing software pose an increasing threat to the integrity of document images. Comparing the input with the corresponding reference document image is a direct and effective approach to verification. Nevertheless, this task is challenging due to two key factors, namely, the need for efficient retrieval of the reference document images and the difficulty of detecting subtle content changes under the print–scan (PS) distortions. To address these challenges, this work proposes a document image verification scheme based on paragraph alignment and subtle change detection. It first extracts paragraph structural features from both input and reference document images to achieve efficient image retrieval and accurate paragraph alignment. Based on the alignment results, the proposed scheme employs contrastive learning to reduce the effect of PS distortions in extracting features from the input and reference document images. Finally, an additional verification step is introduced that significantly reduces the false positive detection by addressing the feature misalignment within the extracted paragraphs. To evaluate the proposed scheme, extensive experiments were conducted on databases constructed from public datasets, and various benchmark methods were compared. Experimental results show that the proposed scheme outperforms benchmark methods, achieving an accuracy score of

0.963

.

Keywords:

document tampering detection; paragraph segmentation; document image retrieval; change detection; contrastive learning

1. Introduction

In recent years, the rapid advancement of information technology and the widespread adoption of online platforms have greatly accelerated the digitization of paper documents, making it a widely used approach for distributing textual information, chart data, and other structured or unstructured content across various domains. For example, scanned document images stored on servers, such as contracts, medical records, and certificates, not only facilitate long-term preservation but also enhance the efficiency of information sharing. However, with the prevalence of image editing and text manipulation software, the integrity of document images faces increasing challenges [1,2]. In particular, the use of tampered document images in illicit activities has emerged as a serious social concern. In fields such as finance, healthcare, and law, where information integrity is critical, tampered document images might cause substantial economic losses. For example, an attacker may alter contract terms to obtain unlawful benefits. As a result, verifying the content of document images is particularly important.

In real-life scenarios, document content is generally diverse and unstructured. Therefore, subtle changes within it are typically difficult to discern [3]. Although such tampering may appear to be subtle, such as altering a single character, digit, or keyword, it can still result in serious consequences. As shown in Figure 1, an attacker tampered with the first line of a paragraph in a rental contract to illicitly change the monetary amount. This type of localized tampering is particularly difficult to detect through manual inspection due to the complexity of the diverse document structures and unknown contents. Consequently, the detection of subtle content change in document images has been garnering increasing attention from researchers in recent years [3,4].

In this work, we consider two common tampering attack scenarios involving document images. As illustrated in Figure 2, an attacker may either alter the content of a digital document before printing and scanning it to produce a tampered document image or tamper with the content of an already scanned document image using image editing software to generate a forged version for illicit purposes. In both scenarios, the genuine and tampered document images are generated through the print–scan (PS) process. A direct and effective approach to verify the content integrity of a document image is to compare it with its corresponding reference document image stored on a secure server, which preserves the original content prior to printing. In such a verification system, the corresponding reference image is first retrieved from a document database, followed by content change detection to determine whether the content has been altered.

In many real-world scenarios, direct access to the original document image may not be possible due to privacy or security constraints, and the recipient may only obtain a printed copy. For example, in lawsuits, evidence collection often involves printed copies of original documents, making verification against the corresponding digital reference a practical and necessary task. In such scenarios, a document may be printed multiple times, and the goal of this work is not to authenticate a specific physical copy but to verify the integrity of its content. As a result, although there is overlap, the problem addressed in this work differs from prior studies that focus on tampering detection without reference images. For example, detecting attacks such as recapturing is useful for revealing potential tampering [5], but the content of a recaptured document may still be genuine. In contrast, this work focuses on the content integrity verification, ensuring that the information in a document image remains consistent with a trusted reference source.

Existing methods for detecting content changes in document images, such as those illustrated in Figure 2, can be broadly classified into two categories: Optical Character Recognition (OCR)-based methods and image feature analysis-based methods. Methods in the first category employ OCR [6] to recognize and extract textual content from document images, which is then compared with the text stored in a database to determine whether a tampering attack has occurred [7]. However, the performance of these methods is heavily dependent on OCR’s language support. In real-world scenarios, a document image may include multiple languages or non-textual elements, such as images and tables, which can easily introduce errors during character recognition. Additionally, the PS process introduces distortions such as noise, blurring, and shape deformation, further reducing the accuracy and robustness of OCR-based methods [8]. As a result, the methods based on OCR are inherently constrained by the limitations of OCR technology, making them unsuitable for detecting tampered document images in complex, real-world scenarios.

Methods in the second category exploit image features to detect the content change in document images. In these methods, image features are first extracted from the input document images and then compared with those of the reference document images stored in a database [7]. Unlike OCR-based methods, methods in this category do not impose restrictions on document content, making it applicable to a wider range of document types. However, to detect subtle changes, the input document image and the corresponding reference document image must be precisely aligned for accurate feature comparison. The geometric distortions introduced during the PS process often result in varying degrees of misalignment between the input and the reference document images [9]. Such distortions lead to spatial mismatches in the extracted features, contributing to a high rate of false detection. Additionally, noise introduced by the PS channel, as well as the unknown response functions of printers and scanners, creates differences in pixel distributions between the input and reference document images [10,11]. These factors further exacerbate the discrepancies between the feature distributions of input and reference document images, thereby degrading the overall performance of content change detection.

To address the challenges faced by existing methods, we propose a document image verification scheme that includes two stages, namely, the document image retrieval stage and the content change detection stage. In the document image retrieval stage, the proposed scheme first extracts paragraph structure features from the reference document images and stores them in a database, where the extracted features serve as the index. Paragraph features are then extracted from the input document images and matched against those in the database to retrieve the corresponding reference document image. Once retrieved, correspondences between the paragraphs of the input and reference document images are established for subsequent content change detection. By relying on paragraph structure rather than textual content, this process avoids the dependence on document content or language, enabling fast retrieval and accurate alignment of paragraphs across diverse document types. In the second stage, deep features are extracted from pairs of matched paragraphs between the input and reference document images, and a contrastive learning framework is employed to address the distortions introduced by the PS channel. Furthermore, an additional verification step is incorporated to address feature mismatches within paragraph pairs, thereby ensuring more accurate feature alignment. By effectively addressing the aforementioned challenges, the proposed scheme demonstrates superior performance compared to benchmark methods, particularly by achieving a high detection rate for content changes while maintaining a low false detection rate.

The main contributions of this work can be summarized as follows:

We propose a document image retrieval method that leverages paragraph structure features from both input and reference document images. Compared with existing methods, it significantly improves retrieval efficiency. In addition, it enables precise paragraph alignment, which substantially facilitates content change detection in document images.
We propose a content change detection method based on contrastive learning. In the proposed method, a tailored loss function is designed to enable the model to distinguish between unchanged and changed content, and a second verification step is incorporated to address false detections. Together, these two mechanisms significantly improve the detection accuracy.
We construct two document image databases comprising the genuine and tampered document images, and conduct extensive evaluations for the proposed document image verification scheme. The results demonstrate that the proposed scheme accurately detects the content change in practical scenarios for documents of general content and outperforms the benchmark methods.

The rest of the paper is organized as follows: Section 2 reviews related works on document image retrieval and content change detection. Section 3 introduces the proposed document image verification scheme. Section 4 evaluates the proposed scheme and compares its performance with existing methods. Finally, Section 5 concludes.

2. Related Works

As discussed in Section 1 and illustrated in Figure 2, the document image verification problem addressed in this work consists of two key components: document image retrieval and content change detection. Existing retrieval methods can be broadly classified into two categories: OCR-based approaches that exploit textual content for indexing and image feature-based approaches. Similarly, existing content change detection methods fall into two categories: those that use OCR and compare the textual content, and those relying on image feature matching.

2.1. Document Image Retrieval

Existing document image retrieval methods can be grouped into two categories: those that use document contextual information as the index for retrieval and those that use document image features as the index for retrieval. The methods in the first category rely on OCR [6,12]. During the construction of the reference document image database, OCR is applied to extract textual information from the reference document images. A portion of the extracted text, such as titles or the initial paragraph, is used as the index in the database [13,14,15]. To retrieve the corresponding reference document image, users apply the OCR again to extract text from the input document, the result of which is then matched against the index database. While these approaches reduce index computation by utilizing only a portion of the textual information, their performance is limited by OCR accuracy, which is heavily influenced by both the document content and image quality. In particular, the OCR-based methods are not applicable for processing non-textual content in input document images. Additionally, distortions introduced by the PS channel, such as noise or blur, often result in errors during text extraction [16].

The methods in the second category utilize image features for document image retrieval [17]. During the document image retrieval, the user provides an input document image, and the system extracts its feature vector, matches it against the feature vectors stored in the database, and returns the corresponding reference document image [18,19,20]. Since these methods exploit image features during the retrieval process, their applicability is not limited by document content. However, feature vectors often consist of hundreds or even thousands of dimensions, leading to excessive storage demands and reduced retrieval speed when indexing large databases [21,22].

In addition to the works on document image retrieval, there are also studies on document image structure analysis, which is related to the document image retrieval problem. Early research explored the decomposition of document pages into hierarchical and layout-based components [23]. Later studies showed that global structural patterns provide strong cues for page characterization [24]. More recent work introduced deep learning techniques for separating text and non-text regions, and advanced multi-modal models emphasized the importance of jointly modeling visual, textual, and layout information [25,26]. However, these works do not target the problem of document image retrieval. Inspired by these studies, we propose a structure-driven document image retrieval method in this work.

2.2. Document Content Change Detection

After retrieving the document image, the content of the input document image is compared with that of the corresponding reference document image. Existing approaches for document content detection can be categorized into two types: OCR-based methods and methods based on image features. The methods of the first category employ OCR to extract text from both the input and reference document images. The extracted text is then compared to detect any changes in the content [27,28]. OCR-based methods offer the advantage of high recognition accuracy for supported languages. Furthermore, converting document images into text facilitates more efficient content comparison. Despite these advantages, OCR-based methods have several limitations. Since OCR technology is specifically designed for text-based content, it is less effective in recognizing or processing non-textual elements within documents [29,30]. Additionally, its performance is affected by distortions introduced in the PS channel, which can easily lead to errors in character recognition. Consequently, detecting subtle changes, such as the alteration of a single character, presents significant challenges for OCR-based methods [9].

The methods in the second category utilize image features to detect changes in document content, including local features such as Scale-Invariant Feature Transform (SIFT), Speeded-Up Robust Features (SURF), Oriented FAST and Rotated BRIEF (ORB) [31,32,33], and deep features [3]. In these approaches, feature vectors are extracted from both the input and reference document images. Subsequently, similarity measures, such as Euclidean distance or cosine similarity, are computed to evaluate the correspondence between the extracted features. Changes in content are detected based on the matching of the extracted features. Notably, image feature-based methods are content-agnostic, making them effective for detecting changes in general document images [34,35]. However, these methods process the document image as a whole and extract a single feature vector or a set of feature vectors to represent the entire image. As a result, for subtle changes such as the alteration of a single character, the extracted feature vector may show negligible differences compared with the reference document image, making such changes difficult to detect [36,37]. Although the method in [3] employs a hierarchical detection strategy to progressively narrow the detection range, it still begins by processing the document image as a whole. Moreover, the reduction of the detection range is achieved by cropping the document image into fixed-size regions, which inevitably include background areas. These approaches are also highly sensitive to misalignment between the input and reference document images. Such misalignment leads to mismatched features, where content from different parts of the documents is incorrectly compared, thereby degrading the performance of content change detection. In addition, distortions introduced by the PS channel create discrepancies in pixel distributions between the input and reference document images, further limiting the ability of image feature-based methods to reliably detect subtle content changes [38,39].

3. The Proposed Document Image Verification Scheme

In this section, the proposed document verification scheme based on paragraph alignment and content change detection is presented. The block diagram of the proposed scheme is shown in Figure 3. The proposed scheme consists of two stages, namely, a document image retrieval stage and a content change detection stage. In the document image retrieval stage, the candidate reference document images corresponding to the input document image are retrieved from a database. In the subsequent content change detection stage, paragraphs segmented from the input and reference document images are first aligned. Features are then extracted from each pair of aligned paragraph images and compared using a contrastive learning-based network. Finally, based on the comparison result, the proposed scheme verifies the content of the input document image.

3.1. Document Image Retrieval

In the document image retrieval stage, the paragraphs of the input and reference document images are first segmented. Based on the segmentation results, descriptors are constructed for the input and reference document images. A 4-step paragraph matching procedure is then performed to match the segmented paragraphs of the input document image to those of the reference documents in the database. The output of the document image retrieval stage is a set of candidate reference document images that share a similar structure with the input document image. From this output, paragraph-level correspondences are also established for each candidate reference document image corresponding to the input document image.

3.1.1. Paragraph Segmentation

In this step, element segmentation is performed on both the input and reference document images. To perform paragraph segmentation, the Faster Region-based Convolutional Neural Network (R-CNN) object detection model [40] is employed. Examples of the element segmentation results are shown in Figure 4. While we employ the Faster R-CNN in this work, the proposed scheme is not limited to this specific segmentation method. Existing document image segmentation methods that are suitable for document images [41,42,43] can be employed to segment the paragraphs, and the segmentation results can be used in the same way as Faster R-CNN, as discussed in Section 4. From the segmentation results, we can obtain a bounding box for each paragraph within the document image, as shown in red bounding boxes in Figure 4.

For each reference document image in a database, a descriptor containing multiple fields is built based on the segmentation results. For the i-th reference document image, its descriptor is defined as

\begin{matrix} I_{i} ≜ \{i, c_{i}, {\{b_{i, j}\}}_{j = 1}^{c_{i}}, {\{a_{i, j}\}}_{j = 1}^{c_{i}}, {\{r_{i, j}\}}_{j = 1}^{c_{i}}, {\{n_{i, j}\}}_{j = 1}^{c_{i}}\}, \end{matrix}

(1)

where i denotes the document index, j denotes the paragraph index,

c_{i}

denotes the number of segmented paragraphs of the i-th reference document image.

b_{i, j}

,

a_{i, j}

,

r_{i, j}

, and

n_{i, j}

denote the bounding box coordinates, normalized area, aspect ratio, and the indices of the surrounding paragraphs for the j-th segmented paragraph in the i-th reference document image. To ensure consistency, the paragraph indices j are assigned in descending order of area, such that

a_{i, 1} \geq a_{i, 2} \geq \dots \geq a_{i, c_{i}}

.

Let

D = {\{I_{i}\}}_{i = 1}^{N}

denote the set of all reference document images in the database, where N is the total number of reference document images, and

I_{i}

represents the descriptor of the i-th reference document image as defined above. The definitions and example values of the fields of the descriptor

I_{i}

are listed in Table 1.

From the descriptor

I_{i}

, the field i serves as a unique identifier of the corresponding reference document image. The bounding box coordinate

b_{i, j}

is defined as

\begin{matrix} b_{i, j} ≜ [\begin{matrix} x_{l}^{(i, j)} & y_{t}^{(i, j)} & x_{r}^{(i, j)} & y_{b}^{(i, j)} \end{matrix}], \end{matrix}

(2)

where

x_{l}^{(i, j)}

,

y_{t}^{(i, j)}

,

x_{r}^{(i, j)}

, and

y_{b}^{(i, j)}

denote the left, top, right, and bottom border coordinates of the j-th segmented paragraph, respectively. The normalized area

a_{i, j}

relative to the largest paragraph segmented is defined as

\begin{matrix} a_{i, j} ≜ \frac{s_{i, j}}{{max}_{j} \{s_{i, 1}, s_{i, 2}, \dots, s_{i, j}, \dots, s_{i, c_{i}}\}}, \end{matrix}

(3)

where

s_{i, j}

denotes the area of the j-th segmented paragraph, defined as

\begin{matrix} s_{i, j} ≜ (x_{r}^{(i, j)} - x_{l}^{(i, j)}) \times (y_{b}^{(i, j)} - y_{t}^{(i, j)}) . \end{matrix}

(4)

The aspect ratio

r_{i, j}

of the j-th segment is defined as

\begin{matrix} r_{i, j} ≜ \frac{x_{r}^{(i, j)} - x_{l}^{(i, j)}}{y_{b}^{(i, j)} - y_{t}^{(i, j)}} . \end{matrix}

(5)

Finally,

n_{i, j} \in N^{4}

includes the index of the neighboring segmented paragraph in the top, bottom, left, and right directions of the j-th paragraph in the i-th reference document image. A value of 0 in any paragraph of

n_{i, j}

indicates that no neighboring segmented paragraph exists in the corresponding direction.

3.1.2. Paragraph Matching

In this step, the descriptor of the input document image is first extracted. The descriptor of the input document image is defined as

\begin{matrix} \tilde{I} ≜ \{\tilde{c}, {\{{\tilde{b}}_{k}\}}_{k = 1}^{\tilde{c}}, {\{{\tilde{a}}_{k}\}}_{k = 1}^{\tilde{c}}, {\{{\tilde{r}}_{k}\}}_{k = 1}^{\tilde{c}}, {\{{\tilde{n}}_{k}\}}_{k = 1}^{\tilde{c}}\}, \end{matrix}

(6)

where k denotes the index of its paragraphs,

\tilde{c}

denotes the number of segmented paragraphs,

{\tilde{b}}_{k}

,

{\tilde{a}}_{k}

,

{\tilde{r}}_{k}

, and

{\tilde{n}}_{k}

denote the bounding box coordinates, normalized area, aspect ratio, and the indices of the surrounding paragraphs for the k-th segmented paragraph in the input document image, respectively.

The bounding box coordinate of the k-th segmented paragraph

{\tilde{b}}_{k}

is defined as

\begin{matrix} {\tilde{b}}_{k} ≜ [\begin{matrix} {\tilde{x}}_{l}^{(k)} & {\tilde{y}}_{t}^{(k)} & {\tilde{x}}_{r}^{(k)} & {\tilde{y}}_{b}^{(k)} \end{matrix}], \end{matrix}

(7)

where

{\tilde{x}}_{l}^{(k)}

,

{\tilde{y}}_{t}^{(k)}

,

{\tilde{x}}_{r}^{(k)}

, and

{\tilde{y}}_{b}^{(k)}

denote the left, top, right, and bottom border coordinates of the k-th segmented paragraph, respectively. The normalized area

{\tilde{a}}_{k}

is defined as

\begin{matrix} {\tilde{a}}_{k} ≜ \frac{{\tilde{s}}_{k}}{{max}_{k} \{{\tilde{s}}_{1}, {\tilde{s}}_{2}, \dots, {\tilde{s}}_{k}, \dots, {\tilde{s}}_{\tilde{c}}\}}, \end{matrix}

(8)

where

{\tilde{s}}_{k}

denotes the area of the k-th segmented paragraph in the input document image, defined as

\begin{matrix} {\tilde{s}}_{k} ≜ ({\tilde{x}}_{r}^{(k)} - {\tilde{x}}_{l}^{(k)}) \times ({\tilde{y}}_{b}^{(k)} - {\tilde{y}}_{t}^{(k)}) . \end{matrix}

(9)

The aspect ratio

{\tilde{r}}_{k}

is defined as

\begin{matrix} {\tilde{r}}_{k} ≜ \frac{{\tilde{x}}_{r}^{(k)} - {\tilde{x}}_{l}^{(k)}}{{\tilde{y}}_{b}^{(k)} - {\tilde{y}}_{t}^{(k)}} . \end{matrix}

(10)

Finally, the indices of the neighboring paragraphs in the top, bottom, left, and right directions of the k-th paragraph in the input document image are written in the vector

{\tilde{n}}_{k} \in N^{4}

, where a value of 0 in the paragraphs of

{\tilde{n}}_{k}

indicates that no neighboring segmented paragraph exists in the corresponding direction.

After the extraction of the descriptor The for the input document image, a 4-step matching procedure is applied to match the paragraphs of the input and the reference document images. The matching procedure is illustrated in Figure 5.

The first step in the matching procedure is count matching, as illustrated in Figure 5. In this step, the number of segmented paragraphs

c_{i}

in each reference document image

I_{i}

is compared with

\tilde{c}

of the input document image The. Specifically, the set of reference document images in

D

matched to The is defined as

\begin{matrix} D_{0} ≜ \{I_{i} \in D | c_{i} = \tilde{c}\}, \end{matrix}

(11)

where

D

is the set of all reference document descriptors, as stated in Section 3.1.1. An example of this matching step is shown in Figure 6.

The second step in the matching procedure refines the candidate reference document images in

D_{0}

by filtering out those whose paragraph areas differ significantly from those of the input document image. Specifically, the set of reference document images in

D_{0}

matched to The is defined as

\begin{matrix} D_{1} ≜ \{I_{i} \in D_{0} | |\{k \in {1, \dots, \tilde{c}} | \exists j \in {1, \dots, c_{i}}, \frac{|{\tilde{a}}_{k} - a_{i, j}|}{{\tilde{a}}_{k}} \leq 0.1\}| = \tilde{c}\}, \end{matrix}

(12)

The definition (12) indicates that each

{\tilde{a}}_{k}

in the input document image The has at least one corresponding paragraph in the reference document image

I_{i}

such that the area differences is within

10 %

of

{\tilde{a}}_{k}

. An example of this matching step is shown in Figure 7.

Similar to the second step, the third step in the matching procedure refines the candidate reference documents in

D_{1}

by filtering out those whose paragraph shapes differ significantly from those of the input document image. Specifically, the set of reference document images in

D_{1}

matched to The is defined as

\begin{matrix} D_{2} ≜ \{I_{i} \in D_{1} | |\{k \in {1, \dots, \tilde{c}} | \exists j \in {1, \dots, c_{i}}, |ln (\frac{{\tilde{r}}_{k}}{r_{i, j}})| < 0.223\}| = \tilde{c}\}, \end{matrix}

(13)

where

{\tilde{r}}_{k}

and

r_{i, j}

denote the aspect ratios of the k-th input paragraph and the j-th paragraph in the i-th reference document image, respectively. The threshold value

0.223

corresponds to an allowable aspect ratio difference in the range of approximately

80 %

to

125 %

relative to the paragraph of the corresponding reference document image. An example of this matching step is shown in Figure 8.

As shown in the figure, the image to be retrieved can be divided into 2 paragraphs with a length greater than height and 6 paragraphs with a length less than height, so all images that meet the shape matching criteria will be added to the candidate list.

Finally, the fourth step in the matching procedure refines the candidate reference documents in

D_{2}

by filtering out those whose surrounding paragraph indices

{n_{i, 1}, n_{i, 2}, \dots, n_{i, c_{i}}}

differ from the input document image. Specifically, the set of reference document images in

D_{2}

matched to The is defined as

\begin{matrix} D_{3} ≜ \{I_{i} \in D_{2} | |\{k \in {1, \dots, \tilde{c}} | {\tilde{n}}_{k} = n_{i, k}\}| = \tilde{c}\} . \end{matrix}

(14)

The definition (14) indicates that the neighboring paragraphs of each paragraph in the input document image have a corresponding match in the i-th candidate reference document image. The set of candidate reference document images

D_{3}

is the output of the paragraph matching step. An example of this matching step is shown in Figure 9.

The proposed document image retrieval method ensures computational efficiency by progressively narrowing the candidate set through a 4-stage matching procedure. It is important to note that, in the final step, at most one match exists between paragraphs of the input document and those of the i-th reference document image, as the surrounding paragraph indices are unique for a given paragraph. As a result, paragraph-level correspondences are established for the candidate reference document images in

D_{3}

corresponding to the input document image. These paragraph-level correspondences are essential for the subsequent paragraph-level content change detection stage, as discussed in Section 3.2.

3.2. Content Change Detection

In the second stage, the proposed scheme detects changes for the input document image by comparing each candidate reference document image in

D_{3}

obtained from the paragraph matching step. Firstly, paragraph alignment is performed to align the paragraphs segmented from the input document image with those from the candidate reference document images. The aligned paragraph images from the input and the candidate reference document images are then cropped into image blocks. A feature extraction network based on a contrast learning network is applied to these image blocks to extract image features. Finally, the features extracted from the input and the candidate reference document images are compared to verify the input document image.

3.2.1. Paragraph Alignment

As described in Section 3.1, the document image retrieval stage establishes correspondences between paragraphs in the input document and those in the candidate reference document images. However, errors may exist in the paragraph segmentation of the input document image due to the PS distortion, as mentioned in Section 1. To address these segmentation errors, the bounding boxes in the input image are aligned based on the corresponding paragraph positions in the candidate reference document images. Specifically, the relative positions of the input paragraphs are adjusted to match the layout of their counterparts in the candidate reference documents.

Among the paragraphs segmented from the i-th candidate reference document image, the best-matched paragraph in the input document image can be written as

\begin{matrix} \bar{k} = arg min_{k \in \{1, \dots, \tilde{c}\}} {∥{\tilde{b}}_{k} - b_{i, k}∥}_{2}^{2}, i \in \{i ∣ I_{i} \in D_{3}\}, \end{matrix}

(15)

where

{\tilde{b}}_{k}

and

b_{i, k}

denote the bounding box coordinates of the k-th paragraph segmented from the input and the i-th candidate reference document image, respectively. The bounding boxes

{\tilde{b}}_{k}

, excluding

{\tilde{b}}_{\bar{k}}

, are adjusted based on the relative positions between

b_{i, \bar{k}}

and the corresponding matched bounding box

b_{i, k}

in the i-th candidate reference document image. By adjusting

{\tilde{b}}_{k}

in this way, the segmented paragraphs in the input document are aligned with those in the candidate reference document image, as illustrated in Figure 10.

3.2.2. Image Feature Extraction

In this step, image features are extracted from the input and candidate reference document images. The feature extraction step starts by cropping the segmented paragraph images from the input and candidate reference document images in

D_{3}

. To enable the detection of subtle content changes, the cropped paragraph images are further cropped into smaller image blocks, as illustrated in Figure 11.

The use of image blocks in feature extraction increases the detection granularity, thereby enabling the detection of subtle changes within a paragraph. However, features extracted from large image blocks may lack the sensitivity to reflect the subtle differences between the input and reference document images. In contrast, features extracted from small image blocks might only cover the paper background, and thus cannot be used to detect the content changes in input document images. In this work, we empirically set the block size to

64 \times 64

pixels based on experimental results, as discussed in Section 4.

To extract features from the cropped image blocks, the proposed scheme employs a contrast learning based network and proposes a loss function to distinguish the features extracted from the input and candidate reference document images. This enables the model to extract features that are relevant to document content, while minimizing the influence of distortions introduced from the PS channel. Specifically, image blocks cropped from the paragraphs of the reference document image, genuine input document image, and tampered input document image are used as anchor, positive, and negative samples, respectively, in the training of the contrastive learning network. The architecture of the contrastive learning framework employed in the proposed scheme is illustrated in Figure 12.

To enhance the network’s feature learning capabilities, a loss function is proposed as

\begin{matrix} L (X_{a_{i}}, X_{g_{i}}, X_{t_{i}}) = max \{0, m_{g} - S (f (X_{a_{i}}), f (X_{g_{i}}))\} + max \{0, S (f (X_{a_{i}}), f (X_{t_{i}})) - m_{t}\}, \end{matrix}

(16)

where

f (X_{a_{i}})

,

f (X_{g_{i}})

, and

f (X_{t_{i}})

denote the feature vectors corresponding to the i-th anchor, genuine, and tampered image blocks, respectively,

m_{g}

and

m_{t}

denote the similarity thresholds for the anchor-genuine and anchor-tampered sample pairs, respectively, and the function

S (\cdot)

denotes the cosine similarity.

In the contrasive learning framework, the ResNet50 network [44] is employed as the feature extraction backbone. During training, the thresholds

m_{g}

and

m_{t}

are set to

0.8

and

0.2

, respectively. The use of margin-based thresholds in the proposed loss function (16) ensures that the loss becomes 0 if the similarity between anchor-genuine pairs exceeds

m_{g}

, or the similarity between anchor-tampered pairs falls below

m_{t}

. This allows the model to concentrate on relatively harder or more ambiguous training samples, while reducing the influence of samples that are already well-separated in the feature space. Such a mechanism helps in mitigating the overfitting by limiting unnecessary optimization on trivial examples, and potentially improves training efficiency by reducing redundant gradient updates.

3.2.3. Content Change Detection

In this step, the trained contrastive learning framework is employed to detect content changes at the paragraph level from the input document image. The block diagram of the content change detection method in the proposed scheme is illustrated in Figure 13.

The verification of the i-th image block cropped from the paragraph segments of the input document images is conducted as follows:

\begin{matrix} D (X_{t_{i}}, X_{r_{i}}) = \{\begin{matrix} 0, & S (f (X_{t_{i}}), f (X_{r_{i}})) \geq t \\ 1, & S (f (X_{t_{i}}), f (X_{r_{i}})) < t \end{matrix}, \end{matrix}

(17)

where

f (X_{t_{i}})

and

f (X_{r_{i}})

denote the feature vectors extracted from the i-th cropped image blocks of the input document paragraph image and the corresponding candidate reference paragraph image, respectively, and t denotes the detection decision threshold.

Due to the geometric distortions introduced by the PS channel, while the proposed scheme performs the paragraph alignment step as described in Section 3.2.1, there may still exist residual misalignment among the cropped image blocks within the aligned paragraphs between the input and reference document images. Such misalignment occurs between the input and reference documents and can affect the accuracy of content change detection. Since precise feature alignment is essential for reliable detection, and a relatively high threshold value t is required to capture as much changed content as possible, even slight misalignment can lead to false detections for cropped image blocks. To address this issue, we introduce an additional verification step that refines the initially detected tampered image blocks, significantly improving overall detection accuracy. Specifically, for the set of cropped image blocks initially classified as containing content changes, denoted as

\begin{matrix} \bar{X} ≜ \{i ∣ D (X_{t_{i}}, X_{r_{i}}) = 1\}, \end{matrix}

(18)

the proposed method applies a sliding window of size

64 \times 64

pixels within a

96 \times 96

pixels region centered on each image block in

\bar{X}

. This region includes the corresponding image block and its neighboring areas, as illustrated in Figure 14.

Based on empirical results, the step size of the sliding window is set to 8 pixels, resulting in 25 cropped regions per image block.

The final detection result for each cropped image block in

\bar{X}

is determined by

\begin{matrix} \hat{D} (X_{t_{i}}, X_{r_{i}}) = \{\begin{matrix} 0, & if S (f (X_{t_{i, j}}), f (X_{r_{i}})) \geq t for any j \\ 1, & otherwise \end{matrix}, i \in \bar{X}, j = 1, \dots, 25, \end{matrix}

(19)

where j denotes the j-th image block obtained using the sliding window within the

96 \times 96

-pixel image region.

Finally, for each candidate reference document image in the set

D_{3}

, the input document image is compared at the paragraph level. If all detection results for image block pairs within a matched paragraph between the input and a given reference document indicate no change, the paragraph is classified as genuine with respect to that reference document image. Conversely, if any block pair within the paragraph is detected as changed, the paragraph is classified as different relative to that reference document. This procedure is repeated for all candidate reference document images in

D_{3}

.

It is important to note that the presence of detected differences does not necessarily imply that the input document has been tampered with, as comparisons may be made against reference documents with similar layouts but different content, as illustrated in Figure 16. Therefore, the input document image is classified as tampered only if for every candidate reference document image

I_{i} \in D_{3}

, at least one paragraph is detected as different.

4. Experimental Results

The performance of the proposed document image verification scheme was evaluated through extensive experiments, and the experimental results are presented in this section. To facilitate the experiments, two document image databases were constructed, reflecting practical scenarios. The first database is used to evaluate the performance of the proposed document image retrieval method, while the second database is used to assess the performance of the proposed scheme in an end-to-end manner. Based on the constructed databases, two experiments were conducted to evaluate the document image retrieval method and the proposed document verification scheme.

4.1. Database Construction

To evaluate the proposed document image verification scheme, two document image databases were first constructed. The first database, referred to as Database A, was derived from the publicly available PubLayNet dataset [41], which was originally developed for document layout analysis. A total of 40,000 samples are randomly selected from the training set of PubLayNet and used as training data to fine-tune the Faster R-CNN network [40] for the paragraph segmentation step. The test set of Database A is built by first selecting 20,000 document images from the PubLayNet dataset randomly. These samples serve as the reference document images in testing the document image retrieval method. The paragraph segmentation labels are given by the PubLayNet dataset. The input document images in the test set consist of three parts. The first part includes 100 samples selected from the constructed reference document image database. The second part is constructed by adding noise, brightness, and contrast changes to the same 100 samples in order to simulate the print and scan process. The third part is constructed by printing and scanning the same 100 samples selected from the reference document database to introduce the effects of the PS channel.

The second database, named Database B, consists of three parts, with each part including 50 document images. The number of paragraphs in each document image ranges from 6 to 15, resulting in a total of 400 paragraph images. The first part includes digital document paragraph images in both English and Chinese, collected from PubMed Central and CNKI [45], respectively. Among the 50 samples in the first part, 40 are collected from PubMed Central for English documents, and 10 are collected from CNKI for Chinese documents. The paragraphs segmented from the collected samples are manually annotated. The second part includes document paragraph images produced by printing and scanning the documents from the first part. The third part includes document paragraph images with subtle content changes. These are generated using Adobe Photoshop to modify and delete parts of the content of the document images from the first and second parts, including modifications to text, images, tables, and other types of content. In addition, the document images modified from the first part are then printed and scanned. Examples of the tampered document images are shown in Figure 15.

In both databases, the image samples that undergo the PS process were generated using a Canon imageRUNNER ADVANCE C5560i printer with 300 DPI. The printed documents are scanned using two scanners, the KONICA MINOLTA bizhub 554e and the Canon imageRUNNER ADVANCE C5560i, both operating at 300 PPI scanning resolution.

4.2. Evaluation of the Proposed Document Image Retrieval Method

In this section, the performance of the document image retrieval method in the proposed scheme is evaluated. The performance metrics include Hit Rate, Precision, Average Retrieval Time, and Retrieval Data Size. The Recall here measures the success rate of including the correct reference document images in the retrieval result. The Precision here measures the proportion of the correctly retrieved document images among the retrieval results. The Average Retrieval Time measures the average running time in milliseconds (ms). The Data Size measures the size of the retrieval result data. These metrics assess not only the accuracy of the retrieved results but also the method’s efficiency.

Database A is used in this part of the experiments. Its training set, consisting of 40,000 document images, is used to fine-tune the pretrained Faster R-CNN network [20] for paragraph segmentation, while its test set, including 300 document images, is used to evaluate the proposed method in practical scenarios. The proposed document image retrieval method is compared with benchmark approaches, including the SIFT-Based Bag of Visual Words (BoVW) [46] and the ResNet-50+Facebook AI Similarity Search (FAISS) [47]. Since the image samples in the PubLayNet dataset are collected with low resolution, which makes OCR recognition unreliable, we did not include OCR-based methods in comparison. All methods were run on a Dell G3 laptop with Intel(R) Core(TM) i5-9300H CPU. For fairness, the average retrieval time reported here measures only the feature vector comparison for the benchmark methods and the paragraph matching process for the proposed method. It does not include the time used for feature extraction and paragraph segmentation. The experimental results are listed in Table 2.

The experimental results show a Hit Rate of

96 %

among the test samples. The high Hit Rate indicates the effectiveness of the proposed document image retrieval method in correctly identifying the reference document images. The proposed method achieves the retrieval time of

24.29

ms and data size of

1.33

MB, which is superior compared to the benchmark methods. This indicates the high efficiency of the proposed method. Although the Hit Rate is slightly lower than that of the benchmark methods, most retrieval failures are caused by inaccurate paragraph segmentation in the input document images. Such segmentation errors lead to significant deviations in the computed descriptors compared to those of the candidate reference images, resulting in failed retrieval. An example of this type of error is shown in the left part of Figure 16. Retrieval errors mainly occur when reference document images have similar paragraph structures. This leads to the retrieval of visually similar but incorrect candidate reference document images, as illustrated in the right part of Figure 4. However, as long as the correct reference document image is included among the retrieval results, the proposed scheme can still extract the correct reference document image during the content change detection stage, at the cost of efficiency.

Moreover, a comparative experiment is conducted to evaluate the use of different paragraph segmentation methods on the proposed document image retrieval method. The results are listed in Table 3.

The results in Table 3 show that the impact of different document image segmentation methods on the performance of the proposed document image retrieval approach is minor. Therefore, the use of Faster R-CNN in our method can be regarded as a general choice.

Finally, an ablation study was conducted to evaluate the impact of each step in the 4-step matching procedure on the overall retrieval performance. The experimental results are listed in Table 4.

The results of the ablation study show that removing the first three matching steps (count, area, and shape matching) only affects the retrieval time, as the fourth step, position matching, has a significant effect on Precision. This is because the position matching step establishes the paragraph correspondences, without which the layout of the input document image cannot be matched to the layout of the candidate reference document images, as shown in Figure 9.

4.3. Evaluation of the Proposed Document Verification Scheme

The second experiment was conducted to evaluate the proposed document image verification scheme. This experiment is conducted in an end-to-end manner, where the input document image is first given to the reference document image retrieval method, and the output is the content detection results. A total of 1600 anchor blocks, 1600 genuine blocks, and 1600 tampered blocks were selected from the segmented paragraph images in Database B to fine-tune the contrastive learning network shown in Figure 12, where the backbone network is a pretrained ResNet50 network [44]. Specifically, the anchor block, genuine block, and tampered block in Figure 12 are image blocks cropped from the paragraph images in the first, second, and third parts of Database B, respectively. The training was conducted with a batch size of 16. The network was trained for 50 epochs, with an initial learning rate

l r

set to

0.001

. Additionally, the learning rate was reduced by

50 %

every 10 epochs to improve convergence. Additionally, the fully connected layers of the ResNet50 backbone were removed to utilize it purely for feature extraction, ensuring the network focuses on learning meaningful representations for content change detection.

To evaluate the proposed document verification scheme, we used 400 genuine paragraph images that had undergone the PS process, along with 400 paragraph images containing subtle content changes, selected from the second and third parts of Database B. The tampered images were labeled as True Positives (TP), while the genuine images were labeled as True Negatives (TN). We first evaluated the proposed document verification scheme without the additional verification step under various values of the threshold t in (17). The results are listed in Table 5.

The results in Table 5 show that the proposed method is capable of detecting subtle content changes in document images by achieving high recall when the threshold t is large. Image blocks cropped from the input image are classified as genuine if their similarity to the corresponding block in the reference paragraph exceeds the threshold t, as in Equation (17), and as tampered if it falls below. As a result, increasing the threshold t makes it more difficult for image blocks to be classified as genuine, leading to a higher number of image blocks being classified as tampered. Since an input paragraph image is classified as tampered if at least one image block is classified as tampered, the number of TP increases, and recall improves as t increases. At the same time, the number of FP also increases due to the stricter threshold, which leads to a gradual decrease in precision.

According to our observations, the performance of the proposed scheme without the additional verification step is influenced by the size of the paragraph. Specifically, for larger paragraphs, the cropped image blocks suffer from more alignment errors within the paragraph image due to distortions introduced by the PS channel. To address this issue, we introduce an additional verification step, as shown in Figure 13, and the corresponding results are presented in Table 6.

In the additional verification step, image blocks initially classified as tampered are verified again using a sliding window, as described in Section 3.2.3. This step effectively reduces classification errors caused by alignment issues within paragraphs by examining the neighboring regions of these blocks, where discrepancies with the reference paragraph image may result from misalignment rather than actual content changes. As a result, it significantly reduces the number of FP under the high threshold setting, while having little impact on the number of TP. Consequently, precision improves greatly while recall remains almost unchanged, as listed in Table 6. Finally, comparative experiments were conducted using benchmark methods, including those based on OCR [6], local image features [31,32,33], and alternative backbone networks [44,48,49,50,51], and ablation experiments were performed on the paragraph alignment.

All experiments were conducted on Database B, and the results are presented in Table 7.

In the first part of the results, the entire segmented paragraph images are used as input to the benchmark methods. Since cropping is performed at a fixed size, characters in the paragraph image are often split across image blocks. As a result, OCR-based methods cannot be directly applied to these cropped blocks, and the entire segmented paragraph images are used as input for the OCR-based detection method [6]. Additionally, OCR-based methods cannot process paragraphs containing non-text content. Consequently, only text paragraphs are evaluated for this method. As shown in Table 7, the OCR-based method [6] achieves high accuracy in detecting content changes in text paragraphs. However, it suffers from a high false detection rate due to character recognition errors caused by distortions in the PS channel. Content change detection methods based on local image features perform poorly when applied to entire paragraph images. These methods exhibit limited sensitivity to subtle content changes, as such changes are often too minor relative to the overall paragraph content to be reflected in the extracted feature vectors. Moreover, the extracted local features are also degraded by distortions introduced from the PS channel, leading to further performance degradation.

In the second part of the results, methods that use image blocks cropped from segmented paragraph images for content change detection are evaluated. The benchmark methods include those based on local image features, as well as the contrastive learning method illustrated in Figure 12, which utilizes deep features extracted from various backbone networks. The paragraph alignment step described in Section 3.2.1 is applied to all methods in this part, and no additional detection verification step is used. For the contrastive learning methods, the threshold t is set to

0.5

, as this value provides a reasonable balance between precision and recall. The results show that the performance of these methods is significantly improved compared to those in the first part. This improvement is attributed to the smaller size of the cropped image blocks, which allows content changes to be more prominently reflected in the feature vectors.

The third part of Table 7 lists the result of the proposed scheme by adding the additional verification step, but with the paragraph alignment step removed. The threshold t is set to

0.9

. Without paragraph alignment, the precision of paragraph segmentation significantly decreased. The reason is the misalignment of image features. Specifically, paragraphs without content change are incorrectly classified as modified due to the misalignment of the image features. This indicates the importance of the paragraph alignment step in the proposed scheme in improving the Precision in document image content change detection.

In the fourth part of the results, both the paragraph alignment and the additional verification step are applied, with the threshold t set to

0.9

. As shown in the results, this leads to a significant improvement in precision compared to the results in the second and third parts. Although the Precision is slightly lower than that of the contrastive learning method using the VGG19 backbone [48], with

t = 0.5

, the Recall is substantially higher due to the higher threshold value of

t = 0.9

.

5. Conclusions

In this work, we proposed a two-stage document image verification scheme that detects content changes in input document images by comparing their content with corresponding reference document images stored in a database. The proposed scheme achieves more efficient retrieval both in time and data size compared to benchmark methods, by exploiting structural features of the input and reference document images through a 4-step paragraph matching procedure. For content change detection, the scheme performs paragraph alignment and employs a contrastive learning network with a tailored loss function to better extract features from input images affected by distortions introduced during the print-scan process. An additional verification step is introduced to further account for the residual geometric distortions, significantly reducing the false positive rate. Moreover, the scheme is language-agnostic and content-independent, relying solely on paragraph-level layout features rather than textual content, enabling it to handle documents in arbitrary languages and with diverse content. However, the effectiveness of the proposed scheme relies on accurate paragraph segmentation, which may be challenged by documents with highly variable or uncommon content types. Additionally, since the proposed scheme is sensitive to feature mismatch, the rotation of the input document image relative to the reference image can easily impact the detection performance. Improving the segmentation robustness for diverse document types and merging the accurate recovery of document-level geometric distortions will be important directions for future work.

Author Contributions

Conceptualization, Z.H.; Methodology, W.J. and Z.H.; Software, W.J.; Validation, D.L., Q.Y. and Z.H.; Formal analysis, W.J.; Investigation, W.J.; Resources, W.J. and Z.H.; Data curation, W.J.; Writing—original draft preparation, W.J. and Z.H.; Writing—review and editing, Z.H.; Supervision, D.L. and Z.H.; Project administration, D.L. and Z.H.; Funding acquisition, D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

PS	Print–Scan
OCR	Optical Character Recognition
TP	True Positives
FP	False Positives
TN	True Negatives
R-CNN	Region-based Convolutional Neural Network
SIFT	Scale-Invariant Feature Transform
SURF	Speeded-Up Robust Features
ORB	Oriented FAST and Rotated BRIEF
DPI	Dots per Inch
PPI	Points per Inch

References

Beusekom, J.V.; Shafait, F.; Breuel, T.M. Text-Line Examination for Document Forgery Detection. Int. J. Doc. Anal. Recognit. (IJDAR) 2013, 16, 189–207. [Google Scholar] [CrossRef]
Jain, R.; Doermann, D.S. Localized Document Image Change Detection. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’15), Tunis, Tunisia, 23–26 August 2015; pp. 786–790. [Google Scholar] [CrossRef]
Park, D.; Kim, S.; Kim, M.; Yarram, N.R.; Joe, S.; Gwon, Y.; Choi, J. Document Change Detection With Hierarchical Patch Comparison. In Proceedings of the IEEE International Conference on Image Processing (ICIP’15), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 665–669. [Google Scholar] [CrossRef]
Park, D.; Yarram, N.R.; Kim, S.; Kim, M.; Cho, S.; Lee, T. Text Change Detection in Multilingual Documents Using Image Comparison. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV’24), Tucson, AZ, USA, 26 February–6 March 2025; pp. 5218–5227. [Google Scholar] [CrossRef]
Hu, Z.; Chen, C.; Mow, W.H.; Huang, J. Document Recapture Detection Based on a Unified Distortion Model of Halftone Cells. IEEE Trans. Inf. Forensics Secur. 2022, 17, 2800–2815. [Google Scholar] [CrossRef]
Mahdi, M.G.; Sleem, A.; Elhenawy, I. Deep Learning Algorithms for Arabic Optical Character Recognition: A Survey. Multicriteria Algorithms Appl. 2024, 2, 65–79. [Google Scholar] [CrossRef]
Jain, R.; Doermann, D.S. VisualDiff: Document Image Verification and Change Detection. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’13), Washington, DC, USA, 25–28 August 2013; pp. 40–44. [Google Scholar] [CrossRef]
Nguyen, T.T.H.; Jatowt, A.; Coustaty, M.; Doucet, A. Survey of Post-OCR Processing Approaches. ACM Comput. Surv. (CSUR) 2021, 54, 1–37. [Google Scholar] [CrossRef]
Neudecker, C.; Baierer, K.; Gerber, M.; Clausner, C.; Antonacopoulos, A.; Pletschacher, S. A Survey of OCR Evaluation Tools and Metrics. In Proceedings of the International Workshop on Historical Document Imaging and Processing (HIP’21), Lausanne, Switzerland, 5–6 September 2021; pp. 13–18. [Google Scholar] [CrossRef]
Dutta, A.; Garai, A.; Biswas, S.; Das, A.K. Segmentation of Text Lines Using Multi-Scale CNN from Warped Printed and Handwritten Document Images. Int. J. Doc. Anal. Recognit. (IJDAR) 2021, 24, 299–313. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Biswas, S.; Riba, P.; Lladós, J.; Pal, U. Beyond Document Object Detection: Instance-Level Segmentation of Complex Layouts. Int. J. Doc. Anal. Recognit. (IJDAR) 2021, 24, 269–281. [Google Scholar] [CrossRef]
Wei, H.; Gao, G. A Keyword Retrieval System for Historical Mongolian Document Images. Int. J. Doc. Anal. Recognit. (IJDAR) 2014, 17, 33–45. [Google Scholar] [CrossRef]
Yan, H.; Watanabe, T. Document page retrieval based on geometric layout features. In Proceedings of the International Conference on Ubiquitous Information Management and Communication (ICUIMC ’13), Kota Kinabalu, Malaysia, 17–19 January 2013; pp. 1–8. [Google Scholar] [CrossRef]
Barbuzzi, D.; Massaro, A.; Galiano, A.M.; Pellicani, L.; Pirlo, G.; Saggese, M. Multi-Domain Intelligent System for Document Image Retrieval. Int. J. Adapt. Innov. Syst. 2019, 2, 282–297. [Google Scholar] [CrossRef]
Hu, S.; Wang, Q.; Huang, K.; Wen, M.; Coenen, F. Retrieval-Based Language Model Adaptation for Handwritten Chinese Text Recognition. Int. J. Doc. Anal. Recognit. (IJDAR) 2022, 26, 109–119. [Google Scholar] [CrossRef]
Doermann, D.S. The Indexing and Retrieval of Document Images: A Survey. Comput. Vis. Image Underst. 1998, 70, 287–298. [Google Scholar] [CrossRef]
Mitra, M.; Chaudhuri, B.B. Information Retrieval from Documents: A Survey. Inf. Retr. 2000, 2, 141–163. [Google Scholar] [CrossRef]
Dutta, A.; Biswas, S.; Das, A.K. CNN-Based Segmentation of Speech Balloons and Narrative Text Boxes from Comic Book Page Images. Int. J. Doc. Anal. Recognit. (IJDAR) 2021, 24, 49–62. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Azmi, R.; Akbari, M.; Akbari, H.; Shirazi, H.R.G. LGL-DIR: Layout Graph for Layout Based Document Image Retrieval. In Proceedings of the International Conference on Education Technology and Computer (ICETC’10), Shanghai, China, 22–24 June 2010; pp. 262–266. [Google Scholar] [CrossRef]
Hoai, D.P.V.; Duong, H.T.; Hoang, V.T. Text Recognition for Vietnamese Identity Card Based on Deep Features Network. Int. J. Doc. Anal. Recognit. (IJDAR) 2021, 24, 123–131. [Google Scholar] [CrossRef]
Namboodiri, A.M.; Jain, A.K. Document Structure and Layout Analysis. In Digital Document Processing: Major Directions and Recent Advances; Springer: London, UK, 2007. [Google Scholar]
Shin, C.K.; Doermann, D.S.; Rosenfeld, A. Classification of document pages using structure-based features. Int. J. Doc. Anal. Recognit. 2001, 3, 232–247. [Google Scholar] [CrossRef]
Tang, Z.; Yang, Z.; Wang, G.; Fang, Y.; Liu, Y.; Zhu, C.; Zeng, M.; Zhang, C.Y.; Bansal, M. Unifying Vision, Text, and Layout for Universal Document Processing. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 19254–19264. [Google Scholar]
Umer, S.; Mondal, R.; Pandey, H.M.; Rout, R.K. Deep features based convolutional neural network model for text and non-text region segmentation from document images. Appl. Soft Comput. 2021, 113, 107917. [Google Scholar] [CrossRef]
Dixit, U.D.; Shirdhonkar, M.S. An Improved Fingerprint-Based Document Image Retrieval Using Multi-Resolution Histogram of Oriented Gradient Features. Int. J. Eng. 2022, 35, 750–759. [Google Scholar] [CrossRef]
Lombardi, F.; Marinai, S. Deep Learning for Historical Document Analysis and Recognition—A Survey. J. Imaging 2020, 6, 110. [Google Scholar] [CrossRef]
van Strien, D.A.; Beelen, K.; Ardanuy, M.C.; Hosseini, K.; McGillivray, B.; Colavizza, G. Assessing the Impact of OCR Quality on Downstream NLP Tasks. In Proceedings of the International Conference on Agents and Artificial Intelligence (ICAART’20), Valletta, Malta, 22–24 February 2020; Volume 1, pp. 484–496. [Google Scholar] [CrossRef]
Cheng, H.; Zhang, P.; Wu, S.; Zhang, J.; Zhu, Q.; Xie, Z.; Li, J.; Ding, K.; Jin, L. M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’23), Vancouver, BC, Canada, 17–24 June 2023; pp. 15138–15147. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Gool, L.V. SURF: Speeded Up Robust Features. In Proceedings of the European Conference on Computer Vision (ECCV’06), Graz, Austria, 7–13 May 2006; pp. 404–417. [Google Scholar] [CrossRef]
Ke, Y.; Sukthankar, R. PCA-SIFT: A More Distinctive Representation for Local Image Descriptors. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04), Washington, DC, USA, 27 June 2004; Volume 2. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G.R. ORB: An Efficient Alternative to SIFT or SURF. In Proceedings of the International Conference on Computer Vision (ICCV’11), Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
de Araujo, P.H.L.; de Almeida, A.P.G.S.; Braz, F.A.; da Silva, N.C.; de Barros Vidal, F.; de Campos, T. Sequence-Aware Multimodal Page Classification of Brazilian Legal Documents. Int. J. Doc. Anal. Recognit. (IJDAR) 2022, 26, 33–49. [Google Scholar] [CrossRef]
Shirdhonkar, M.S.; Kokare, M. Document Image Retrieval Using Signature as Query. In Proceedings of the International Conference on Computer and Communication Technology (ICCCT’11), Allahabad, India, 15–17 September 2011; pp. 66–70. [Google Scholar] [CrossRef]
Bay, H.; Ess, A.; Tuytelaars, T.; Gool, L.V. Speeded-Up Robust Features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
Da, C.; Luo, C.; Zheng, Q.; Yao, C. Vision Grid Transformer for Document Layout Analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’23), Paris, France, 1–6 October 2023; pp. 19462–19472. [Google Scholar] [CrossRef]
Markewich, L.; Zhang, H.; Xing, Y.; Lambert-Shirzad, N.; Jiang, Z.; Lee, R.K.W.; Li, Z.; Ko, S.B. Segmentation for Document Layout Analysis: Not Dead Yet. Int. J. Doc. Anal. Recognit. (IJDAR) 2022, 25, 67–77. [Google Scholar] [CrossRef]
Elanwar, R.I.; Qin, W.; Betke, M.; Wijaya, D.T. Extracting Text from Scanned Arabic Books: A Large-Scale Benchmark Dataset and A Fine-Tuned Faster-R-CNN Model. Int. J. Doc. Anal. Recognit. (IJDAR) 2021, 24, 349–362. [Google Scholar] [CrossRef]
Zhong, X.; Tang, J.; Jimeno-Yepes, A. PubLayNet: Largest Dataset Ever for Document Layout Analysis. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’19), Sydney, NSW, Australia, 20–25 September 2019; pp. 1015–1022. [Google Scholar] [CrossRef]
Biswas, S.; Banerjee, A.; Llad’os, J.; Pal, U. DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer. arXiv 2022, arXiv:2201.11438. [Google Scholar]
Maity, S.; Biswas, S.; Manna, S.; Banerjee, A.; Llad’os, J.; Bhattacharya, S.; Pal, U. SelfDocSeg: A Self-Supervised vision-based Approach towards Document Segmentation. arXiv 2023, arXiv:2305.00795. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Liu, X.; Chau, K.Y.; Liu, X.; Wan, Y. The Progress of Smart Elderly Care Research: A Scientometric Analysis Based on CNKI and WOS. Int. J. Environ. Res. Public Health 2023, 20, 1086. [Google Scholar] [CrossRef] [PubMed]
Shekhar, R. Document Image Retrieval Using Bag of Visual Words Model. Master’s Thesis, International Institute of Information Technology Hyderabad, Hyderabad, India, 2013. [Google Scholar]
Rahman, M.S.; Rabbi, S.M.E.; Rashid, M.M. Optimizing Domain-Specific Image Retrieval: A Benchmark of FAISS and Annoy with Fine-Tuned Features. arXiv 2024, arXiv:2412.01555. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations (ICLR’15), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Sandler, M.; Howard, A.G.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-Level Accuracy with 50× Fewer Parameters and <0.5 MB Model Size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]

Figure 1. An example of a tampering attack on document images. The left image shows the genuine document image, while the right image shows the tampered version created by altering the content of the left image.

Figure 2. The document image tampering attacks and detection framework considered in this work. The attacker either alters the content of the original digital document before conducting a print-scan process, or tamper with the scanned document images directly. The document image verification is performed by comparing the input document image to its corresponding reference image stored in the document database and conducting change detection.

Figure 3. Block diagram of the proposed document image verification scheme. The proposed scheme consists of two stages: the document image retrieval stage and the content change detection stage.

Figure 4. Examples of paragraph segmentation results produced by the Faster R-CNN model [40], trained on the PubLayNet dataset [41]. The example document images contain various paragraphs, including text paragraphs, titles, and figures. Detected paragraph regions are highlighted with red bounding boxes, demonstrating the effectiveness of the Faster R-CNN approach in capturing diverse structural components across different document types.

Figure 5. The block diagram of the proposed 4-step paragraph matching procedure.

Figure 6. An example of the count matching results, where the detected paragraph regions are marked with red bounding boxes. The resulting candidate reference document images contain 8 paragraphs, matching those in the input document image.

Figure 7. An example of the count matching results, where the detected paragraph regions are marked with red bounding boxes. The resulting candidate reference document images contain 2 large-area paragraphs and 6 small-area paragraphs, matching those in the input document image.

Figure 8. An example of the shape matching results, where the detected paragraph regions are marked with red bounding boxes. The resulting candidate reference document images contain 2 paragraphs with a width greater than height and 6 paragraphs with a length less than height, matching those in the input document image.

Figure 9. An example of the position matching results, where the detected paragraph regions are marked with red bounding boxes. The resulting candidate reference document images contain a similar layout to the input document image.

Figure 10. An example of paragraph alignment. The bounding box coordinates

{\tilde{b}}_{k}

of the segmented paragraphs in the input document image are adjusted based on the relative position between

{\tilde{b}}_{k}

and its matched paragraph

b_{i, \bar{k}}

in the reference document image. The bounding box marked in blue indicates the anchor paragraph indexed by

\bar{k}

in the reference document image and its corresponding matched paragraph in the input document image. The dashed bounding boxes marked in red indicate the other detected paragraph.

Figure 10. An example of paragraph alignment. The bounding box coordinates

{\tilde{b}}_{k}

of the segmented paragraphs in the input document image are adjusted based on the relative position between

{\tilde{b}}_{k}

and its matched paragraph

b_{i, \bar{k}}

in the reference document image. The bounding box marked in blue indicates the anchor paragraph indexed by

\bar{k}

in the reference document image and its corresponding matched paragraph in the input document image. The dashed bounding boxes marked in red indicate the other detected paragraph.

Figure 11. An example of paragraph image cropping, where the dashed red lines indicate the cropping boundaries, and each cropped image block has a size of

64 \times 64

pixels. The segmented image is from the input document image and contains realistic distortions introduced by the print–scan (PS) process.

Figure 11. An example of paragraph image cropping, where the dashed red lines indicate the cropping boundaries, and each cropped image block has a size of

64 \times 64

pixels. The segmented image is from the input document image and contains realistic distortions introduced by the print–scan (PS) process.

Figure 12. The architecture of the contrastive learning framework in the proposed scheme, in which the model takes an anchor image along with a genuine and a tampered image, encodes them using a shared feature extractor, and computes similarity scores to minimize the contrastive loss.

Figure 13. Block diagram of the proposed content change detection method. The deep features are extracted from the segmented paragraph images of the input and reference document images.

Figure 14. An example of the sliding window applied to each cropped image block of the paragraph image segmented from the input document image. The red solid rectangle and the blue dashed rectangle mark the sliding window and its containing region, respectively.

Figure 15. Document tampering examples. The red dashed rectangles mark the tampered regions. The left column shows the original digital document without the print-scan process, the middle column shows the document images tampered using Adobe Photoshop, and the right column shows the tampered document image after the print-scan process.

Figure 16. Examples of the two types of errors in document image retrieval. The left figure shows the error caused by incorrect paragraph segmentation, while the right figure shows the error resulting from similar structural patterns between two document images.

Table 1. The constructed reference document image descriptor

I_{i}

based on the segmentation results.

Table 1. The constructed reference document image descriptor

I_{i}

based on the segmentation results.

Field	Meaning	Example Value
i	Document index	$i = 2013$
$c_{i}$	paragraph count	$c_{i} = 9$
${b_{i, 1}, b_{i, 2}, \dots, b_{i, c_{i}}}$	Bounding box coordinates	$b_{i, 1} = [\begin{matrix} 313.08 448.95 554.07 730.32 \end{matrix}]$
${a_{i, 1}, a_{i, 2}, \dots, a_{i, c_{i}}}$	Normalized area of paragraphs	$a_{i, 1} = 0.5$
${r_{i, 1}, r_{i, 2}, \dots, r_{i, c_{i}}}$	Aspect ratio of paragraphs	$r_{i, 1} = 0.8565$
${n_{i, 1}, n_{i, 2}, \dots, n_{i, c_{i}}}$	Surrounding paragraphs indices	$n_{i, 1} = [\begin{matrix} 7 0 4 0 \end{matrix}]$

Table 2. Experimental results of document image retrieval using the proposed and benchmark methods. The best results are highlighted in bold.

Methods	Precision	Hit Rate	Average Retrieval Time (ms)	Data Size (MB)
SIFT-BoVW [46]	$0.955$	$0.98$	$91.78$	$24.3$
ResNet-50+FAISS [47]	$0.952$	$1.00$	$40.91$	$11.6$
Proposed	$0.976$	$0.96$	$24.29$	$1.33$

Table 3. Experimental results of document image retrieval using the proposed method with different paragraph segmentation methods.

Methods	Precision	Hit Rate
Faster R-CNN [20]	$0.976$	$0.96$
Mask R-CNN [41]	$0.976$	$0.96$
DocSegTr [42]	$0.979$	$0.97$
SelfDocseg [43]	$0.979$	$0.97$

Table 4. Experimental results of the ablation study for the 4-step matching process in the proposed document image retrieval method.

Ablation Step	Precision	Hit Rate	Average Retrieval Time (ms)
All Steps	$0.976$	$0.96$	$24.29$
Remove Count Matching	$0.976$	$0.96$	$28.24$
Remove Area Matching	$0.976$	$0.96$	$26.25$
Remove Shape Matching	$0.976$	$0.96$	$25.64$
Remove Position Matching	$0.886$	$0.96$	$22.38$

Table 5. Experimental results of the proposed content change detection method under different t without the additional verification step. The highest value in each column is highlighted in bold.

t	Precision	Recall	Accuracy	F1 Score
0.3	0.951	0.437	0.707	0.599
0.4	0.963	0.587	0.782	0.729
0.5	0.957	0.907	0.938	0.931
0.6	0.825	0.932	0.867	0.876
0.7	0.781	0.947	0.841	0.855
0.8	0.769	0.953	0.833	0.851
0.9	0.767	0.978	0.840	0.860

Table 6. Experimental results of the proposed content change detection method under different t with the additional verification step. The highest value in each column is highlighted in bold.

t	Precision	Recall	Accuracy	F1 Score
0.3	0.989	0.438	0.716	0.606
0.4	0.987	0.588	0.790	0.734
0.5	0.992	0.908	0.950	0.948
0.6	0.987	0.933	0.960	0.959
0.7	0.962	0.948	0.955	0.954
0.8	0.955	0.953	0.954	0.954
0.9	0.949	0.978	0.963	0.963

Table 7. Results of the comparative experiments: (First section) Methods applied to entire segmented paragraph images; (Second section) Methods applied to image blocks cropped from segmented paragraphs; (Third section) Proposed method without the paragraph alignment; (Fourth section) Proposed method with paragraph alignment. The highest value in each column is highlighted in bold.

Method	Precision	Recall	Accuracy	F1 Score
SIFT [32]	$0.695$	$0.142$	$0.540$	$0.236$
ORB [33]	$0.797$	$0.187$	$0.570$	$0.303$
SURF [31]	$0.735$	$0.145$	$0.536$	$0.240$
OCR [6]	$0.831$	$0.935$	$0.872$	$0.879$
SIFT [32]	$0.891$	$0.825$	$0.862$	$0.857$
ORB [33]	$0.904$	$0.830$	$0.871$	$0.865$
SURF [31]	$0.879$	$0.820$	$0.854$	$0.848$
ResNet50 [44]	$0.957$	$0.907$	$0.938$	$0.931$
ResNet18 [44]	$0.896$	$0.888$	$0.893$	$0.892$
VGG19 [48]	$0.976$	$0.910$	$0.930$	$0.942$
VGG16 [48]	$0.950$	$0.900$	$0.914$	$0.925$
MobileNetV2 [49]	$0.896$	$0.900$	$0.898$	$0.898$
SqueezeNet [50]	$0.887$	$0.883$	$0.885$	$0.885$
InceptionV3 [51]	$0.945$	$0.908$	$0.928$	$0.926$
Proposed (No Paragraph Alignment)	$0.725$	$0.975$	$0.788$	$0.814$
Proposed (With Paragraph Alignment)	$0.949$	$0.978$	$0.963$	$0.963$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, D.; Jia, W.; Yu, Q.; Hu, Z. Document Image Verification Based on Paragraph Alignment and Subtle Change Detection. Appl. Sci. 2025, 15, 12430. https://doi.org/10.3390/app152312430

AMA Style

Li D, Jia W, Yu Q, Hu Z. Document Image Verification Based on Paragraph Alignment and Subtle Change Detection. Applied Sciences. 2025; 15(23):12430. https://doi.org/10.3390/app152312430

Chicago/Turabian Style

Li, Daoquan, Weifei Jia, Quanlin Yu, and Zhaoxu Hu. 2025. "Document Image Verification Based on Paragraph Alignment and Subtle Change Detection" Applied Sciences 15, no. 23: 12430. https://doi.org/10.3390/app152312430

APA Style

Li, D., Jia, W., Yu, Q., & Hu, Z. (2025). Document Image Verification Based on Paragraph Alignment and Subtle Change Detection. Applied Sciences, 15(23), 12430. https://doi.org/10.3390/app152312430

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Document Image Verification Based on Paragraph Alignment and Subtle Change Detection

Abstract

1. Introduction

2. Related Works

2.1. Document Image Retrieval

2.2. Document Content Change Detection

3. The Proposed Document Image Verification Scheme

3.1. Document Image Retrieval

3.1.1. Paragraph Segmentation

3.1.2. Paragraph Matching

3.2. Content Change Detection

3.2.1. Paragraph Alignment

3.2.2. Image Feature Extraction

3.2.3. Content Change Detection

4. Experimental Results

4.1. Database Construction

4.2. Evaluation of the Proposed Document Image Retrieval Method

4.3. Evaluation of the Proposed Document Verification Scheme

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI