A Method of Image Quality Assessment for Text Recognition on Camera-Captured and Projectively Distorted Documents

: In this paper, we consider the problem of identity document recognition in images captured with a mobile device camera. A high level of projective distortion leads to poor quality of the restored text images and, hence, to unreliable recognition results. We propose a novel, theoretically based method for estimating the projective distortion level at a restored image point. On this basis, we suggest a new method of binary quality estimation of projectively restored ﬁeld images. The method analyzes the projective homography only and does not depend on the image size. The text font and height of an evaluated ﬁeld are assumed to be predeﬁned in the document template. This information is used to estimate the maximum level of distortion acceptable for recognition. The method was tested on a dataset of synthetically distorted ﬁeld images. Synthetic images were created based on document template images from the publicly available dataset MIDV-2019. In the experiments, the method shows stable predictive values for different strings of one font and height. When used as a pre-recognition rejection method, it demonstrates a positive predictive value of 86.7% and a negative predictive value of 64.1% on the synthetic dataset. A comparison with other geometric quality assessment methods shows the superiority of our approach.


Introduction
The object recognition problem, which has been extensively studied in past decades, has a wide range of real applications. The accuracy of recognition systems is always the first priority. However, the error cost largely depends on the problem's specifics or on the particular application of the developed system. In many areas, such as identity verification [1][2][3], self-driving vehicles [4,5], and industrial diagnostics [6,7], incorrect recognition can cause financial loss or even harmful health outcomes. For this reason, the prediction of recognition reliability is vital for such systems. Obtaining an uncertain result should lead to the rejection of the image processing output or the transfer of control back to the user to prevent unfortunate situations. Therefore, modern recognition systems include different types of reliability assessment modules. There are three main approaches to reliability evaluation: recognition confidence analysis, pixel-based image quality assessment, and geometric image quality assessment. These approaches work at different recognition stages and, of course, can be applied together.
The first approach involves the estimation of the recognition confidence provided by the recognition module. Such systems aggregate the confidences of all recognized objects, such as text lines, and decide to accept or reject the recognition result depending on the error cost [8]. However, these methods have several problems. First, recognition evaluation of their connection to the level of projective distortion and recognition accuracy. For example, considering the relative area of a recognized object, images restored from one area may have significantly different quality ( Figure 2).  In this paper, we propose a novel, no-reference method for the quality assessment of images restored from projectively distorted sources. The image quality is considered in terms of the probability of correct text recognition. The proposed method was tested experimentally on synthetic data created from the publicly available dataset MIDV-2019 [27].

Document Image Quality Assessment Problem Statement
We consider the problem of document recognition in images obtained with a mobile camera. We use the pinhole camera model (Figure 3), so the camera is assumed to have no optical aberrations. Given that the document is a flat rectangular object, the document image is affected by projective distortion [28], and the document boundary is a quadrangle. Document recognition systems commonly consist of several submodules: document localization in a source image, segmentation of required zones such as text and photo fields, and field image restoration and recognition ( Figure 4). Considering the field segmentation step, the majority of the systems utilize document models. There are three general classes of models: templates, flexible forms, and end-to-end models. Templates define the strictest constraints on the location of each zone and are most commonly used for identity documents. In [25,29,30], document templates are used for the localization and classification of document images, but they are also helpful in field segmentation [31]. Flexible form models are based on text segmentation and recognition result analysis and describe documents with soft restrictions on their structure. This model may contain text feature points [32] or attributed relational graphs [33] as a structural representation of a document. End-to-end models involve the simultaneous segmentation and recognition of text field regions [34] and may not require any document structure.  We consider identity document recognition systems such as [2] based on the template description of documents. For many identity documents, the regions of text and photo fields are fixed, and text fonts and font properties (size, boldness, etc.) for each of them are known. This information may be inserted into the document template description and used to further assess field image quality.
We assume that the results of the field segmentation are provided as a quadrangle in the source image. According to its coordinates, the field image should be restored and recognized. The main goal of this paper is to assess the quality of the restored image in terms of the reliability of recognition before the restoration itself (see Figure 5). If the quality is insufficient, then the system ceases further processing to prevent false recognition results. Moreover, the possibility of early rejection decreases the runtime of the system, as the restoration and recognition are not performed on images of low quality. Based only on the source field quadrangle and a priori information of its size and fon, the restored field image can be assessed by relying on known properties of further submodules. We briefly discuss the submodules below.
The restoration submodule resamples the source image according to the projective transform, which maps the quadrangle of the field in the source image to the rectangle of a predefined size in the document model. The resampling process is usually characterized by interpolation and antialiasing methods. Given that, under a high level of projectivity, the field image may have an arbitrary small area and consequently may not be recognizable, in this work, we focus only on the magnification problem when mapping magnifies a source region. In this particular case, anti-aliasing methods can be excluded, as after magnification, the restored image cannot contain high frequencies. The most well-known interpolation methods [35] are the nearest neighbor, bilinear, bicubic, and cubic B-spline methods. For all of the mentioned interpolation methods, except the nearest-neighbor algorithm, a small source area causes blur in the restored area, as shown in Figure 2. The images obtained by nearest-neighbor interpolation have comparatively low quality (see Figure 6), so we exclude this method from consideration.
The results presented in [21,36] show that the presence of blur in text field images decreases the quality of recognition. It should be noted that the recognition submodule may contain a pre-processing step that refines the image using a deblurring method, for example, [37]. However, its scope is limited, and a level of blurring exists under which the text field cannot be reliably recognized.  Assuming that the image restoration and the text recognition submodules can be predefined, we can use the given font and size of the field to estimate the maximum local distortion level that provides stable recognition of any text. We assume that the level can be evaluated as a rational value and denote the threshold distortion level as θ ∈ R. In this work, however, it is more convenient for us to use the inverse value l ∈ R, l = 1 θ , which we call the minimum scaling coefficient threshold. It should be noted that this threshold value is presumed to be evaluated once while developing the recognition system. Let us denote the source image as I src , the segmented field quadrangle, i.e., four points of its corners in the source image, as F, and the rectangle of the restored image borders, defined in the document model, as R. We need to estimate whether the quality of the restored field image I rst is sufficient in terms of reliability for further text recognition. For this purpose, let us denote the quality assessment function as Q. Q analyzes the source field quadrangle F, the restored field rectangle R, and the a priori minimum scaling coefficient threshold l and returns 1 if the image quality allows for reliable recognition and 0 otherwise: where F is the set of all quadrangles lying inside the source image and R is the set of all possible rectangles. The function Q does not take the restored image itself as an argument. The evaluation process here is assumed to involve the analysis of a geometric transform rather than pixel intensities. Therefore, the quality assessment can be conducted before the use of the restoration submodule ( Figure 3).

The Models of Distorted Field Image Acquisition and Restoration
First, let us briefly describe the model of projectively distorted text field image acquisition [35]. For simplification, we consider the one-dimensional case. Let us define the undistorted field image signal as a continuous bounded function I(x): where B is the upper bound of I(x). While being captured with a camera, the signal is distorted with a projective transform u = H(x): where I c src (u) is the continuous projectively distorted signal. Then, the signal I c src (u) is sampled by a function s(u) with a known sampling pitch ∆u s to obtain a discrete image I src (k), k ∈ Z: where I d src (u) is the sampled distorted signal defined on R. We consider ideal sampling with the following: where δ(u) is the Dirac delta function. The image I src (k) is the input of the recognition system. Before the final text recognition, the image should be restored to compensate for the projective distortion. In the image restoration process, the image is resampled with the inverse of the original projective transform x = H −1 (u). This transformation can be evaluated based on the source field quadrangle F obtained in the field segmentation step and the rectangle of the restored field R defined by the template description: R = H −1 (F). The resampling model is as follows. The discrete image I src (k) is reconstructed to obtain a continuous signal I c src (u) through convolution with a reconstruction filter r(u): After that, the domain of the continuous signal I c src (u) is warped with the projective transform x = H −1 (u): where I c rst (x) is the restored continuous signal. Depending on the mapping function H −1 (x), I c rst (x) may have arbitrary high frequencies. To conform to the Nyquist rate, the signal should be bandlimited by a prefilter function h(x, y) that prevents aliasing: whereÎ c rst (x) is the bandlimited restored signal and denotes convolution. Then, the obtained signal is sampled with the same sampling pitch ∆x s = ∆u s : where I d rst (x) is the sampled restored signal on R and I rst (j) is the discrete restored signal. In this paper, we consider only the magnification case, when the source region is stretched: In this scenario, the signal mapping cannot provide high frequencies. Therefore, the prefilter has little impact on the restored image signal and can be ignored. Then, the restored image I rst (j) is as follows: Let us define the sample pitches as equal to 1: ∆x s = ∆u s = 1. For simplicity, we refer to discrete images I src (k) and I rst (j) as I src (u) and I rst (x) and specify u, x ∈ Z. Then, Formula (11) can be rewritten as follows: The ideal reconstruction filter r(u), u ∈ R, is an ideal low-pass filter sinc = sin πx /πx according to the cardinal theorem of interpolation [39]. However, in practice, one uses its approximations with a finite window radius R: The bilinea, bicubic B-spline, and bicubic reconstruction functions have finite windows of radii R = 1 and R = 2.
We also assume that the reconstruction function is Lipschitz continuous with constant M: Hypothesis 1. The bilinear, bicubic B-spline, and bicubic reconstruction functions (see Figure 7) are Lipschitz continuous.
Let us consider the bilinear reconstruction function. (14), where r l (u) is defined as follows:

Lemma 1. The bilinear interpolation function r l (u) is Lipschitz continuous
Proof. Let us consider a pair of arbitrary points x, y. Due to the piecewise nature of r l (u), we have three cases.
Case 2: ∀x, y ∈ (−1, 1) By the reverse triangle inequality, we can obtain: Hence, the bilinear reconstruction function r l (u) is Lipschitz continuous with the constant M = 1.
The bicubic B-spline and bicubic reconstruction functions are shown in Figure 7b,c. We can see that they are continuous and have a bounded value increment and are thus Lipschitz continuous. The direct proof of Hypothesis 1 falls outside of the scope of this work.

The Minimum Scaling Coefficient Assessment at a Restored Image Point
In [21,36], the authors incorporated an estimation of image blur into the algorithms of a combination of text recognition results in a video stream. Since an unblurred text image has high contrast in regions corresponding to strokes, they assume that the level of image blur is inversely related to the sharpness (called focus in the cited papers), which represents the directional minimum of the highest local contrasts of the image. In these papers, the blur is caused by defocusing or motion blur and is constant for the whole image. The sharpness is calculated based on the intensities in the source image. For this purpose, gradient images are calculated in different directions; for each of them, a 0.95 quantile of the gradient image is obtained, and their minimum represents the sharpness estimation.
In our case, the blurring distortion of the restored image is caused by projective mapping and, hence, is uneven over different points of the image. Let us consider the original undistorted image I(x) and denote its local contrast in a region between neighboring sampling points [x,x + ∆x s ] as L(x, ∆x s ) : 0 < L(x, ∆x s ) ≤ B/∆x s : One can verify whether the restored image is able to provide the expected contrast in this region. Let us denote the local contrast of the restored image as L rst (x, ∆x s ). It can be calculated as follows: where I rst (x) is the discrete restored signal; I src (k) is the discrete distorted signal; and Z are sets of samples in the source image that are used for the reconstruction of samplesx andx + ∆x s , respectively. According to (10), the distance between points H(x) and H(x + ∆x s ) in the source image is less then its sampling pitch: Then, in the worst case, the points H(x) and H(x + ∆x s ) have the same set of samples used for reconstruction, i.e., K 1 = K 2 . In that case, the contrast (20) provided by the restored image can be estimated as follows: where |K 1 | is the size of set K 1 .
If the restored local contrast is much lower than the contrast in the original undistorted image L(x, ∆x s ), then the restored image edges are highly blurred or even undetectable. As we can see, the upper bound of the restored image local contrast L rst (x, ∆x s ) depends on the distance between the corresponding points in the source image |H(x) − H(x + ∆x s )|. Thus, the smaller the distance, the higher the level of blur distortion in the considered region. Then, the ratio of the distance between source points to the sampling pitch can be used to estimate the maximum achievable sharpness of the restored region. Let us denote this function as the scaling coefficient s(x, ∆x): Above, we considered the one-dimensional case; however, the image is a two-dimensional function. The projective transform of the plane (u, v) = H(x, y) is determined as follows: where h q,w , q, w ∈ {0, 1, 2} are the coefficients of the projective transform H.
The projective transform map points unevenly. For a fixed point (x, y) and several shifts (∆x m , ∆y m ) of one length: |(∆x m , ∆y m )| = const ∀m, the distance |H(x, y) − H(x + ∆x m , y + ∆y m )| can significantly vary. Since the directions of text strokes causing high local contrast in the image are arbitrary, the sharpness should be estimated for all possible shifts. Image sampling is conducted with a grid, so the sampling pitch in different directions also varies. However, the function s(x, ∆x s ) is the length ratio, so a useful simplification is to consider ∆x s equal for all of them. It should be noted that, here, we implicitly change the domain of the function s(x, ∆x s ) from Z 2 to R 2 . This can be performed because the image function is no longer used, and the projective transform H is defined on a set of real numbers.
Then, the scaling coefficient function s(p, ∆x s ) defined in (23) should be rewritten for the two-dimensional case as follows: Let us denote this function as the minimum scaling coefficient. The region under consideration is a circle with the center at the pointp = (x,ȳ) and the radius equal to ∆x s : The projective transform (u, v) = H(x, y) maps the points of the infinity line l ∞ : h 2,0 x + h 2,1 y + h 2,2 = 0 to infinity. If the line crosses or touches the circle, then some points of its inner region become infinite, which is not possible in image restoration. Consequently, we can assume that the circle is not crossed by the l ∞ line and is mapped onto an ellipse. Then, the length a min of the ellipse semi-minor axis is the minimum distance between pairs of projected points: min Since ∆p is assumed to be small, one can locally approximate the projective transform H with an affine transform. In this approach, it can be shown that, for a unit circle, the lengths of the ellipse semi-axes are equal to the roots of eigenvalues λ min and λ max of the matrixJ TJ , whereJ is the Jacobian matrix of the transform H at the pointp [40]. Then, for the circle with the radius ∆x s , the lengths of the semi-minor and semi-major axes for the restored pointp, a min and a max , respectively, are calculated as follows: It should be noted that the points on the infinity line l ∞ become infinite under the transformation, so eigenvalues are not defined on this line. Then, the length function domain is R 2 \ l ∞ .
It is a well-known fact that the eigenvalues are the roots of the characteristic equation. Then, the lengths of the semi-minor and semi-major axes can be calculated as follows: One can derive the values of the trace and the determinant of the matrixJ TJ expressed in terms of coefficients of the homography H: In this work, we use only the values of the semi-minor axis length. However, the other lengths may be helpful in the problem of image decimation estimation. To illustrate the behavior of the semi-minor and semi-major length functions, we constructed heatmaps for a synthetic example. An arbitrary source quadrangle F ( Figure 8a) and a restored rectangle R (Figure 8b,c) were used to estimate semi-minor ( Figure 8b) and semi-major (Figure 8c) axis lengths at grid points on the restored plane. As we can see, the values increase as we approach the infinity line l ∞ , shown as a blue line in the figure. The region inside rectangle R with semi-minor axis lengths less than the threshold appears to be connected.
This function can be used to estimate the local sharpness at each point of the restored image and is directly related to the local image quality. It should be noted that, if the transformation H is affine, i.e., h 2 2,0 + h 2 2,1 = 0, then the Jacobian matrix and the minimum scaling coefficient are constant for the whole plane. Thus, only one value at an arbitrary point can be calculated.

The Proposed Method of Projectively Distorted Image Quality Assessment
Next, we define the quality assessment method Q, which provides a binary estimation of the whole restored image in terms of recognition reliability. Considering that incorrect recognition of any character leads to incorrectness of the whole recognized field text, the image quality can be estimated according to the region with the lowest quality.
For this purpose, we can estimate the maximum level of local distortion θ that enables stable recognition of the restored image. The threshold depends on the recognition subsystem and on the chosen interpolation algorithm. Since the function s(p) is inversely proportional to the local distortion level, for simplification, we use the minimum scaling coefficient threshold l, which is the inverse of the level of distortion θ: Then, we can construct the level curve of the minimum scaling coefficient function as follows: If the level curve intersects the restored rectangle R, then one of two corresponding parts of the restored field image is not recognized reliably. Otherwise, if there is no intersection, we can calculate the value for one arbitrary point inside the rectangle to check whether the whole restored image has sufficient quality.
According to (31) and (30), the level curve (33) can be written as follows: This equation holds for both the minimum scaling coefficient function s(p) and the maximum scaling coefficient function s max (p), which is defined as the ratio of the semimajor axis length to the sampling pitch ∆x s : If the s max (p) branch intersects the rectangle, then both of its parts have low quality. For simplification, Equation (34) is translated to the new coordinate system by transformation T: (X, Y) = T(x, y) = (h 2,1 x − h 2,0 y, h 2,0 x + h 2,1 y + h 2,2 ).
Under this transform, the infinity line l ∞ is mapped to the line Y = 0. After the substitution of (36) into the level curve in Equation (34), we obtain the following: where γ = h 2,0 c 1 + h 2,1 c 2 , δ = h 2,0 c 3 + h 2,1 c 4 and α, β, c 1 , c 2 , c 3 , c 4 are defined in (30). As we can see, Equation (37) is quadratic in terms of X and, hence, symmetric. Then, we can approximate it by a piecewise linear curve. For this purpose, the minimum and maximum Y values of the rectangle R are calculated. After that, we choose several values Y i , i = {0, n − 1}, Y i = 0 between them and, for each Y i , calculate two corresponding X coordinates of the curve according to the following equality: We should also take into account that, for semi-minor and semi-major axis lengths, both branches of the curve may intersect the rectangle simultaneously. In order to construct the curve approximation correctly, we need to separate the points related to different branches. Then, moving along the Y-axis for each Y i value, we compare the corresponding discriminant D i with zero. If it is positive, then the obtained points lie on one branch. If the discriminant for a Y i value is equal to zero, then there is an inflection point in the current branch and the following values Y i+k , k = {1, n − i − 1} relate to another branch of the curve. Similarly, the negative discriminant D i implies a gap between branches, and points calculated for further values Y i+k lie on another branch.
As soon as the curve is obtained, we should decide whether the considered field quality is sufficient. There are several possible approaches. For example, we can calculate the ratio of sufficient and insufficient quality areas inside the restored rectangle. However, in this work, we mark the quality of the whole image as insufficient if there is a low-quality region of any area. The whole procedure for evaluating the restored image quality has O(1) complexity because it is not dependent on the input image size but only on the number of points in the curve approximation, which we assume to be predefined. The procedure is summarized in Algorithm 1.
Algorithm 1 For quality assessment of a projectively distorted field quadrangle.
Input: F-field quadrangle in source image; R-rectangle of restored field; l-minimum scaling coefficient threshold; n-vertex number of curve approximation.

Experimental Results
In this section, the experimental results obtained using the proposed algorithmfor the quality assessment of projectively distorted field images are presented and compared with the performance of the algorithm described in [25]. In the recognition system workflow, in order to obtain a quadrangle of the field to be restored and recognized, we need to perform document localization and field segmentation. Evaluating quality assessment methods, we had to eliminate the errors that occurred in these stages. For this purpose, datasets that provide ground truth for field quadrangles are commonly used. To the best of our knowledge, the only publicly available dataset with at least mild projective distortions is MIDV-2019 [27]. However, a preliminary experiment showed that it does not contain images with enough projectivity to produce insufficient restored image quality. For this reason, we created a dataset with synthetically distorted images of text fields.

Data Generation
In order to generate the data, we used the MIDV-2019 dataset. This dataset contains 50 different types of annotated identity documents (ID cards, passports, driving licenses, etc.). It consists of 50 template images (original high-quality document images used for creating physical document copies, one per document type) and video clips of these documents acquired in different conditions. An example of a template document is shown in Figure 9.
All of the images were annotated manually. The video frames have a ground truth for their type and document quadrangle. The template images have a ground truth description consisting of field rectangles and their text content. We considered template images only and scaled them to 300 dpi to obtain comparable pixel sizes for all documents. We used ground truth field rectangles to extract undistorted images of text fields with an additional 10% margin of their size. We only considered numeric fields and fields written with the Latin alphabet: dates, document numbers, machine-readable zone (MRZ) lines, document holder name, and surname. We recognized text in the obtained field images with Tesseract Open Source OCR Engine 4.1.1, which employs the LSTM neural network [41]. Incorrectly recognized fields were eliminated from further processing. In our experiments, we used 184 fields collected from all document templates. Since the text in the fields may have different fonts, font sizes, and other properties, we considered them separately in our experiments. Here, we describe synthetic data generation for one field.
We denote an original image of a field f as D f and a rectangle bounding the field as R f .
To test our algorithm, we generated a set of N projectively distorted field images .N and corresponding projective trans- In order to generate a distorted quadrangle F i f , we added random shifts to the corners of R f . Then, the quadrangle F i f was downscaled to approximately the same size as the original field image for a more representative dataset. We also ensured that the obtained distorted quadrangle F i f and corresponding quadrangle of the whole distorted document were convex. Then, the homography H i f was calculated, and the original field image was transformed to obtain the distorted field image: I i src, f = H i f (D f ). Algorithm 2 shows the procedure of distorted image generation.
Then, the restoration process was conducted. The distorted images {I i src, f } i=1..N were rectified with projective transforms that map their bounding quadrangles F i f to the rectangles R f : The projective mapping of images was conducted using the bilinear interpolation method.
Finally, we generated the ground truth for our problem of the binary quality assessment. We consider it to be a binary classification problem, with a positive case when «field image is recognizable» and a negative case otherwise. Thus, we used Tesseract to recognize the restored field images I i rst, f and compared the results with the annotation from MIDV-500. If the recognition was correct, then the restored image was marked as recognizable.

Performance Metrics
To evaluate the performance of quality assessment algorithms, we calculated the positive and negative predictive values, PPV and NPV, respectively, as follows: where TP is the number of true-positive samples (restored field images were correctly recognized by Tesseract and marked as recognizable by the quality assessment algorithm under evaluation), TN is the number of true-negative samples (fields were not recognized by Tesseract and marked as non-recognizable by the algorithm), FP is the number of false-positive samples (fields were not correctly recognized by Tesseract but marked as recognizable by the algorithm), and FN is the number of false-negative samples (fields were correctly recognized by Tesseract but marked as non-recognizable by the algorithm).
We also had to ensure the balance of data used to evaluate the algorithm. The decision made by the proposed quality assessment algorithm Q depends on the minimum scaling coefficient threshold l. Hence, the probability of randomly generating a sample predicted to be positive or negative varies when l changes. To overcome this issue, for each l, we took 1000 restored field images marked as positive and 1000 restored field images marked as negative by the algorithm. T(A t , B t , C t , D t )-a bounding rectangle of a whole undistorted document, where A t , B t , C t , and D t are points of its corners from top left to bottom left clockwise; N-the number of samples to generate. Output:

Behavior of the Proposed Method for Fields of Same and Different Fonts
In the framework of the first experiment, we estimated the variations in the PPV and NPV for the proposed algorithm Q, depending on the minimum scaling coefficient threshold l. We calculated the PPV and NPV functions separately for each field f . We varied the threshold l values from 0.075 to 0.9 with a step of 0.025. For each threshold l, we generated 1000 positively and 1000 negatively marked images and calculated the predictive values. The parameter n of the algorithm Q that defines the vertex number of the level curve approximation was set to 100. Figure 10 shows an example of the estimated PPV and NPV curves that were calculated for several text fields of the new Austrian driving license document, which is shown in Figure 9.
While developing the assessment method, we assumed that the threshold is equal for all characters of one font. Thus, the predictive value functions should be close for different fields of one font and may vary if the font or font properties (size, boldness, etc.) are changed. As we can see, the curves for the date fields with the same font (Figure 10a-c) show almost equal predictive values, as was expected. This means that we can estimate the valid threshold for all possible text fields of one font in advance. At the same time, PPV and NPV differ for a document number field that has a bold font (Figure 10d). Comparing them, we can infer that bold text can be more projectively distorted while still being reliably recognized. Thus, the minimum scaling coefficient threshold should be chosen separately for each font and font property.
For all fields, the specific behavior of the curves is similar. The greater the l, the sharper the restored image should be to be marked as «recognizable». Indeed, in Section 2, we define the minimum scaling coefficient threshold to be inverse to the level of distortion θ. As the threshold l increases, rejection occurs at a lower level of distortion. The threshold value can be chosen according to the cost of false-positive and false-negative errors. In the case of equal cost, the PPV and NPV are higher than 80% for all four considered fields.
It should be noted that the obtained predictive value curves are non-monotonic. This effect occurs because the OCR is not strictly monotonic with the projective distortion level. However, the tendency toward reduced recognition accuracy is evident.

Recognition System Simulation
In the second experiment, we estimated the recognition system's performance with the incorporated reject submodule. We compared the results obtained for the proposed algorithm with the rejection criterion presented in [25], which assesses the whole distorted document quadrangle. In addition, we estimated the same algorithm applied to each field quadrangle separately.
The geometric criterion presented in [25] is based on the analysis of the quadrangle angles. The document quadrangle is rejected if not satisfying the following conditions: 1.
At least one pair of the opposed edges is parallel with a tolerance of 5 • :

2.
The average difference in angles between each pair of opposed angles is less than 10 • : whereÂ,B,Ĉ, andD are the angles of the quadrangle defined in the range [0 • , 180 • ].

3.
The average perpendicularity of the four corners is less than 25 • : In order to estimate the system performance and to avoid errors that may occur in the document localization and segmentation stages, we synthesized distorted field images, as described in Section 6.1. Before the experiment evaluating the performance of the proposed algorithm, we automatically estimated the field thresholds for the proposed algorithm as follows. Each of the 184 original field images D f was gradually uniformly downscaled from 0.9 to 0.1 of its size with a step of 0.025. The smallest scale that provided a correct recognition result was chosen as the threshold l f . Then, for each field f and threshold l f , we generated 1000 positively and 1000 negatively marked restored field images. The proposed algorithm parameter n defining the vertex number of the level curve approximation was set to 100. All positive images for all fields were contained in the overall positive set with a size of 184,000. The overall negative set was obtained similarly. The restored images of both sets were recognized using Tesseract, and the cumulative PPV and NPV values were calculated.
In the experiments conducted to evaluate the algorithm [25], we used two versions of the criterion. The first, original criterion assesses the document quadrangle and, thus, ceases further processing of all document fields simultaneously. Additionally, we evaluated the strategy of applying the criterion to each distorted field quadrangle. For both versions, we used the same processes of data generation and performance evaluation, except that the set of 1000 images predicted to be recognized was constructed based on the algorithm under evaluation. The same applies to the set predicted to be unrecognized.
The results of the conducted experiments are shown in Table 1. It can be seen that the thresholds of the algorithm in [25] were defined under the assumption of a much higher cost of false-positive error. However, the proposed algorithm outperforms both versions of the algorithm from [25] not only in NPV but also in PPV. Examples of false-positive and false-negative field images for the proposed method are shown in Figures 11 and 12, respectively. As we can see, for some, the recognition error is due to the OCR submodule, while the images themselves can be easily read. In the examples of false-negative images, the level of corruption differs. For example, field (b) is barely recognizable, while field (e) has adequate sharpness. The main reason is that we estimated the minimum possible sharpness in all directions. However, if the image is scaled orthogonally to the stroke, the blurring effect is small, which is seen in Figure 12e.  Another possible reason for the proposed algorithm errors is the chosen approach for the estimation of the threshold. Due to the errors of the recognition submodule, some of the fields may overestimate the minimum scaling coefficient threshold. Moreover, in real applications, the text of a considered document field differs from the text in the template image. The current threshold estimation method is limited to only one possible text version. Thus, a more stable approach to threshold estimation needs to be developed to increase the performance of the algorithm. However, the presented results show that the proposed algorithm for text field quality assessment can already be successfully exploited for recognition reliability prediction.

Conclusions
In this paper, we consider the problem of quality assessment of a field image restored from a projectively distorted source document image. The quality is interpreted in terms of text recognition reliability. The results show that, by using a priori information about the field font, the restored field image quality can be estimated based only on the projective transform analysis. We present a theoretically based method for evaluating the distortion level at a point in the restored image. Moreover, we propose a novel algorithm of binary quality assessment that does not depend on the image size, i.e., it has O(1) complexity. We also discuss the model of the reject submodule embedded in the document recognition system.
The algorithm was tested on synthetically distorted field images. The dataset was created based on document template images from the publicly available dataset MIDV-2019. According to the obtained results, the algorithm provides equal predictive value curves, both positive and negative, for different text strings of one font and one font size. For dissimilar fonts, these curves differ. Thus, the assumption is confirmed that the maximum level of distortion that enables reliable recognition depends on the font of the recognized text. Therefore, the threshold of the algorithm can be estimated in advance for each font, regardless of the text that may occur in the input distorted field images.
In the experiment evaluating the performance of the reject submodule, we compared the proposed algorithm with the rejection criterion presented in [25]. This algorithm is designed to assess the whole document quadrangle and, therefore, to reject or accept all document fields simultaneously. Additionally, we applied the same criterion separately for each distorted field image. The thresholds for the proposed algorithm were estimated in advance for each field by iterative downscaling of the undistorted field image and by recognizing the obtained image. The results show the superiority of our algorithm. The cumulative positive predictive value (PPV) for the proposed algorithm equals 86.7%, which is 7.5% higher than the best PPV value of other compared algorithms. The cumulative negative predictive value (NPV) estimated for our algorithm is 64.1%, and the difference from the best value of the other algorithm is 39.5%.
For future work, a more stable method for estimating the threshold should be developed. It should utilize all alphabet characters of an estimated font and projective distortions in addition to the scaling transform. Additionally, the current approach may be improved by relying on the ratio of sufficient and insufficient region areas defined by the constructed level curve.
It should be noted that the proposed method may be exploited not only for the reject submodule. The other possible application field is combination methods for text field recognition in video streams. The binary quality estimation can be used to reevaluate the confidence of the recognition result for one frame. Moreover, as the method also provides the level curve that bounds the low-quality region, we can utilize it to reevaluate the confidence of each recognized character according to its location. This may increase the video stream recognition accuracy.
For future work, a stable method of threshold estimation should be developed. It has to analyze the recognition correctness after restoration from different levels of projective distortions instead of only scaling transformations. The whole alphabet of the font considered is to be included to provide a stable threshold for all possible text strings in the field. Additionally, an experimental comparison should be conducted for the approaches to image quality estimation according to the constructed level curve. The current approach may be improved by relying on the ratio between sufficient and insufficient region areas.