Voting-Based Document Image Skew Detection

: Optical Character Recognition (OCR) is an indispensable tool for technology users nowadays, as our natural language is presented through text. We live under the need of having information at hand in every circumstance and, at the same time, having machines understand visual content and thus enable the user to be able to search through large quantities of text. To detect textual information and page layout in an image page, the latter must be properly oriented. This is the problem of the so-called document deskew, i.e., ﬁnding the skew angle and rotating by its opposite. This paper presents an original approach which combines various algorithms that solve the skew detection problem, with the purpose of always having at least one to compensate for the others’ shortcomings, so that any type of input document can be processed with good precision and solid conﬁdence in the output result. The tests performed proved that the proposed solution is very robust and accurate, thus being suitable for large scale digitization projects.


Introduction
Whenever documents are manually scanned, a human-made orientation error almost always occurs, preventing the Optical Character Recognition (OCR) engine from properly detecting the scanned content. Thus, we can say that almost all documents that are manually scanned are also skewed.
The problem of skew detection and correction is not a new one [1]. In order to ensure a correct layout analysis of image documents, or to have a high accuracy in the OCR process output, the input images must be skew free, otherwise the rows and columns (in the case of multi-column documents) might not be detected properly [2].
There have been numerous attempts to solve the deskew [3] problem, occasionally accompanied by text orientation detection [4,5]. The proposed approaches range from projection profiling [6], neighborhood clustering [7,8], Hough transform applied on different selected key-points [9][10][11], Fast Fourier Transform (FFT) [12], Principal Component Analysis (PCA) [13], Radon transform [14], vertical projections [15], morphology [16], machine learning approaches [17], etc. They obtain robust and accurate results in most cases considering that the elements and properties they try to detect and measure are present in the input image. However, this is not always the case. Projection profiling and neighborhood clustering need consistent lines of paragraph-like text, the former working best in single-column layouts. They do not work well if significant chunks of text are not available. Hough transform may well detect or line-like separators are not present and the available text is not bottom-aligned, this method will fail or, even worse, will return erroneous results. FFT may assist in obtaining a dominant alignment angle in the input image, no matter what the content of the image is, but its uptime is no better than the aforementioned methods. The main idea that derives from here is that one can develop a voting scheme using multiple experts' decisions and confidence values [18][19][20], thus compensating the shortcomings or erroneous decisions of one approach with the benefits of another. The current paper aims to offer an application-oriented, practical solution with a robust real-world behavior and not just a general framework, theoretically-oriented, based on multiple experts' decision, alternative for classical approaches. Three skew detection algorithms were chosen to cover all the facets of the skew-related problems: the difficulty of finding the text orientation due to the shortage of aligned texts, the difficulty in finding lines and tables which may guide in finding the correct skew, the difficulty in understanding the image at all (thus the need to analyze it in the frequency domain instead of the value domain) and not just to exemplify the superiority of a voting-based decision mechanism over an independent decision. This paper presents an approach to deskewing documents, designed for general purpose use, for generic documents that contain different text or geometrical features that may be used in finding a correct orientation. We combine the outputs of three different algorithms, each one also outputting a confidence measure in its work. The confidence is used to characterize the candidate for the voting, the skew angle result computed by each algorithm. The effect of having algorithms of various classes which operate on different aspects of the input image is that the winner algorithm compensates for the shortcomings of the other two. An aggregate confidence of the entire solution is also provided, so that it may be fully integrated into a real-world, mass production digital content producer application.
We will consider that the absolute value of the skew angle will never exceed 10 • , or else the document is skewed on purpose. It can be any value, as we do not need this limitation in order for the algorithms to work, but doing this considerably saves execution time. So, we will refer the Skew Angle Search Space (SASS) to be the interval [−10 • , 10 • ].

Skew Detection Algorithms
We present three approaches to detect skew, first making use of the frequency domain, and the other two using spatial domain exclusively.

Mechanism
For tilted pages, near-vertical and near-horizontal lines occur in the frequency domain, due to the definition of the Fourier Transform and because text documents are composed of mostly vertical and horizontal sinusoids.
For an image which contains stripes, the Fast Fourier Transform (FFT) [21] yields points which align in a straight line, passing through the center of the FFT image and perpendicular to the stripes, as shown in Figure 1.
Appl. Sci. 2020, 10, 2236 2 of 12 transform may well detect table-like structures and baseline dominant text orientation. If tables or line-like separators are not present and the available text is not bottom-aligned, this method will fail or, even worse, will return erroneous results. FFT may assist in obtaining a dominant alignment angle in the input image, no matter what the content of the image is, but its uptime is no better than the aforementioned methods. The main idea that derives from here is that one can develop a voting scheme using multiple experts' decisions and confidence values [18][19][20], thus compensating the shortcomings or erroneous decisions of one approach with the benefits of another. The current paper aims to offer an application-oriented, practical solution with a robust realworld behavior and not just a general framework, theoretically-oriented, based on multiple experts' decision, alternative for classical approaches. Three skew detection algorithms were chosen to cover all the facets of the skew-related problems: the difficulty of finding the text orientation due to the shortage of aligned texts, the difficulty in finding lines and tables which may guide in finding the correct skew, the difficulty in understanding the image at all (thus the need to analyze it in the frequency domain instead of the value domain) and not just to exemplify the superiority of a votingbased decision mechanism over an independent decision. This paper presents an approach to deskewing documents, designed for general purpose use, for generic documents that contain different text or geometrical features that may be used in finding a correct orientation. We combine the outputs of three different algorithms, each one also outputting a confidence measure in its work. The confidence is used to characterize the candidate for the voting, the skew angle result computed by each algorithm. The effect of having algorithms of various classes which operate on different aspects of the input image is that the winner algorithm compensates for the shortcomings of the other two. An aggregate confidence of the entire solution is also provided, so that it may be fully integrated into a real-world, mass production digital content producer application.
We will consider that the absolute value of the skew angle will never exceed 10°, or else the document is skewed on purpose. It can be any value, as we do not need this limitation in order for the algorithms to work, but doing this considerably saves execution time. So, we will refer the Skew Angle Search Space (SASS) to be the interval [−10°, 10°].

Skew Detection Algorithms
We present three approaches to detect skew, first making use of the frequency domain, and the other two using spatial domain exclusively.

Mechanism
For tilted pages, near-vertical and near-horizontal lines occur in the frequency domain, due to the definition of the Fourier Transform and because text documents are composed of mostly vertical and horizontal sinusoids.
For an image which contains stripes, the Fast Fourier Transform (FFT) [21] yields points which align in a straight line, passing through the center of the FFT image and perpendicular to the stripes, as shown in Figure 1.  Lines in the frequency domain emerge from the patterns of parallel lines of text and horizontal/vertical separators, including whitespace. The angles of these lines lead to the document skew angle.
The FFT image is binarized with Otsu [22] and then the Hough transform [23] is applied to detect the dominant lines. Only lines that pass through the center of the FFT image are of interest. The skew angle is calculated as the angle between each dominant line and the closest axis. A weighted sum of these lines is then returned. The vertical line represents horizontal stripes in the input image, which is the information of the text rows, and the horizontal line represents vertical separators, including whitespace.

Improvements
The FFT image presents the challenge of detecting the dominant lines which represent the document orientation. It is noisy because of the high-frequency signals generated by text, small symbols spread across the whole document, images, etc. The FFT image also lacks contrast. These two shortcomings make it difficult for the Hough transform to identify the dominant lines which give the document orientation.
We solved these issues by stretching the contrast of the FFT image and then equalizing its histogram, before Otsu binarization and then feeding it to the Hough transform procedure. These operations help the binarization to emphasize dominant lines.
We have experimented with various methods of emphasizing the correct dominant lines by processing both the input image and the FFT image. Empirically, we have come to the conclusion that there is no need to process the input image in the spatial domain, as its FFT already contains all the orientation information, visually. Altering it only decays the dominant lines in the frequency domain. So, we only processed the FFT image.
Due to the noise in the FFT image, despite losing information in the input image, we initially tried blurring in the spatial domain in order to filter out high frequencies, but quickly discarded it as it produces fainter dominant lines in the frequency domain. Dilation [24] of the page content also produces incomplete lines on the FFT. These make the binarization produce gaps in the dominant lines and make it impossible to detect them using Hough transform.
Then we experimented with blurs in the frequency domain (namely Gaussian blur, box blur, bilateral filter, and median filter [25]), before binarization, and they partially worked, but none turned out to be nearly as feasible as contrast stretching followed by histogram equalization, and even when used together with these two, blurs do not improve accuracy.
In Figure 2, we present the result of contrast stretching and histogram equalization.
Appl. Sci. 2020, 10, 2236 3 of 12 Lines in the frequency domain emerge from the patterns of parallel lines of text and horizontal/vertical separators, including whitespace. The angles of these lines lead to the document skew angle.
The FFT image is binarized with Otsu [22] and then the Hough transform [23] is applied to detect the dominant lines. Only lines that pass through the center of the FFT image are of interest. The skew angle is calculated as the angle between each dominant line and the closest axis. A weighted sum of these lines is then returned. The vertical line represents horizontal stripes in the input image, which is the information of the text rows, and the horizontal line represents vertical separators, including whitespace.

Improvements
The FFT image presents the challenge of detecting the dominant lines which represent the document orientation. It is noisy because of the high-frequency signals generated by text, small symbols spread across the whole document, images, etc. The FFT image also lacks contrast. These two shortcomings make it difficult for the Hough transform to identify the dominant lines which give the document orientation.
We solved these issues by stretching the contrast of the FFT image and then equalizing its histogram, before Otsu binarization and then feeding it to the Hough transform procedure. These operations help the binarization to emphasize dominant lines.
We have experimented with various methods of emphasizing the correct dominant lines by processing both the input image and the FFT image. Empirically, we have come to the conclusion that there is no need to process the input image in the spatial domain, as its FFT already contains all the orientation information, visually. Altering it only decays the dominant lines in the frequency domain. So, we only processed the FFT image.
Due to the noise in the FFT image, despite losing information in the input image, we initially tried blurring in the spatial domain in order to filter out high frequencies, but quickly discarded it as it produces fainter dominant lines in the frequency domain. Dilation [24] of the page content also produces incomplete lines on the FFT. These make the binarization produce gaps in the dominant lines and make it impossible to detect them using Hough transform.
Then we experimented with blurs in the frequency domain (namely Gaussian blur, box blur, bilateral filter, and median filter [25]), before binarization, and they partially worked, but none turned out to be nearly as feasible as contrast stretching followed by histogram equalization, and even when used together with these two, blurs do not improve accuracy.
In Figure 2, we present the result of contrast stretching and histogram equalization.

Confidence
We define a confidence metric (Equation (1)), which depends on the detected dominant lines' strength in an FFT image I. The absolute value of the difference between two parameters: the mean color of all pixels L within some fixed distance range from the dominant line, and the mean color of the rest of the image, I\L, the complement of L. We consider black to be zero and white to be one, so the confidence is also between zero and one. The obtained measure is fine-tuned by being raised to some positive exponent τ 1 , since a balance between all algorithms' confidences will be enforced by employing a power-law relation, and their corresponding τ i will be chosen to ensure equal average success confidence (i.e., excluding confidence in failure cases).

Projection Profiling Variant
Instead of doing line detection with the Hough transform on the FFT image, we found the skew angle via the maximization of the variance in the projection space, as described in Section 2.2. We are interested in the variance with respect to both axes, centered in the middle of the FFT image. This is shown in Figure 3 with blue.

Confidence
We define a confidence metric (Equation 1), which depends on the detected dominant lines' strength in an FFT image I. The absolute value of the difference between two parameters: the mean color of all pixels L within some fixed distance range from the dominant line, and the mean color of the rest of the image, I \ L, the complement of L. We consider black to be zero and white to be one, so the confidence is also between zero and one. The obtained measure is fine-tuned by being raised to some positive exponent , since a balance between all algorithms' confidences will be enforced by employing a power-law relation, and their corresponding will be chosen to ensure equal average success confidence (i.e., excluding confidence in failure cases).

Projection profiling variant
Instead of doing line detection with the Hough transform on the FFT image, we found the skew angle via the maximization of the variance in the projection space, as described in Section 2.2. We are interested in the variance with respect to both axes, centered in the middle of the FFT image. This is shown in Figure 3 with blue. This alternative to the Hough line detection proves to be more stable, though the accuracy is approximately the same. Therefore, for both benchmarks and the fine-tuning exponent search, we used projection profiling in the frequency domain.

Projection Profiling Row Align
The projection profiling method [26] computes histograms on the vertical axis of the document, accumulating horizontal scanline black pixel count, assuming the text is black on white background. The algorithm swipes a scanline from the top to the bottom of the document, which writes the number of black pixels it contains in a buffer, at an index proportional to its current vertical position in the document.
We refer to this buffer as the histogram or the scanline pixel count buffer. Its size is fixed, is set by the user and is referred to as the histogram resolution.
The histogram will present the maximum variance when the text is skew-free, because of scanlines containing complete horizontal slices of lines of text (above the mean), and scanlines completely containing whitespace (completely below the mean). In other words, the skew angle is computed as the angle in the SASS which produces maximal scanline variance. This alternative to the Hough line detection proves to be more stable, though the accuracy is approximately the same. Therefore, for both benchmarks and the fine-tuning exponent search, we used projection profiling in the frequency domain.

Mechanism
The projection profiling method [26] computes histograms on the vertical axis of the document, accumulating horizontal scanline black pixel count, assuming the text is black on white background. The algorithm swipes a scanline from the top to the bottom of the document, which writes the number of black pixels it contains in a buffer, at an index proportional to its current vertical position in the document.
We refer to this buffer as the histogram or the scanline pixel count buffer. Its size is fixed, is set by the user and is referred to as the histogram resolution.
The histogram will present the maximum variance when the text is skew-free, because of scanlines containing complete horizontal slices of lines of text (above the mean), and scanlines completely containing whitespace (completely below the mean). In other words, the skew angle is computed as the angle in the SASS which produces maximal scanline variance. The process is illustrated in Figure 4. Variance of the histogram H of size N is given by the equation: where is cycling through all connected components axis-aligned bounding boxes (AABBs), the number of top or bottom points that lie on the same horizontal scanline at a document height correspondent to position k in the histogram. As the resolution of H is fixed, k is not equal to the y coordinate of the points in document space.
is the mean of H and is the angle by which the points are rotated.

Improvements
First, the image is binarized via standard bimodal Otsu thresholding. Then, we find the connected components in the image and compute their axis-aligned bounding boxes (AABBs) [27]. The purpose is that we only try to align points of interest into horizontal rows, landmarks of connected components which represent text symbols.
The top and bottom midpoints of the AABBs' sides are the only points we need to rotate in order to find the skew angle. We illustrate the mechanism in Figure 5. To avoid computation with elements that are not text symbols, we compute the average width and height of the AABBs. If an AABB of a connected component is considerably larger than this average, we discard it from the rotation phase, as it does not represent a text symbol, or does not belong to a block of text. Variance σ 2 of the histogram H of size N is given by the equation: where H k is cycling through all connected components axis-aligned bounding boxes (AABBs), the number of top or bottom points that lie on the same horizontal scanline at a document height correspondent to position k in the histogram. As the resolution of H is fixed, k is not equal to the y coordinate of the points in document space. H is the mean of H and θ is the angle by which the points are rotated.

Improvements
First, the image is binarized via standard bimodal Otsu thresholding. Then, we find the connected components in the image and compute their axis-aligned bounding boxes (AABBs) [27]. The purpose is that we only try to align points of interest into horizontal rows, landmarks of connected components which represent text symbols.
The top and bottom midpoints of the AABBs' sides are the only points we need to rotate in order to find the skew angle. We illustrate the mechanism in Figure 5. Variance of the histogram H of size N is given by the equation: where is cycling through all connected components axis-aligned bounding boxes (AABBs), the number of top or bottom points that lie on the same horizontal scanline at a document height correspondent to position k in the histogram. As the resolution of H is fixed, k is not equal to the y coordinate of the points in document space.
is the mean of H and is the angle by which the points are rotated.

Improvements
First, the image is binarized via standard bimodal Otsu thresholding. Then, we find the connected components in the image and compute their axis-aligned bounding boxes (AABBs) [27]. The purpose is that we only try to align points of interest into horizontal rows, landmarks of connected components which represent text symbols.
The top and bottom midpoints of the AABBs' sides are the only points we need to rotate in order to find the skew angle. We illustrate the mechanism in Figure 5. To avoid computation with elements that are not text symbols, we compute the average width and height of the AABBs. If an AABB of a connected component is considerably larger than this average, we discard it from the rotation phase, as it does not represent a text symbol, or does not belong to a block of text. To avoid computation with elements that are not text symbols, we compute the average width and height of the AABBs. If an AABB of a connected component is considerably larger than this average, we discard it from the rotation phase, as it does not represent a text symbol, or does not belong to a block of text.
With a small step, such as a tenth of a degree, we compute the variance of the document's bottom and top histograms for all angles in the SASS and pick the winner skew angle as the one which gives the greatest sum of the aforementioned variances.
It is unknown where the global maximum is for the document's scanline variance as a function of an angle, so it is safest to search the whole SASS, rather than using some fast convergence heuristic like a binary search or gradient descent. Unlike them, the brute force search guarantees to find the global maximum. The search space contains 200 or 300 angle candidates, and the scanline variance for one angle is computed in less than 100 microseconds, which is due to the fact that we only rotate at most a couple of thousand points per iteration. Therefore, the brute force search is the most recommended approach here.
However, by sampling the variance for all angles in SASS, we are able to define a confidence metric, as described in Section 2.2.3.

Confidence
Let σ σ 2 be the standard deviation of the variance function σ 2 (H(θ)) as discussed in Section 2.2.1. Let σ 2 be the mean of the histogram variances for all θ ∈ SASS. If θ skew ∈ SASS is the angle which produces the maximum value of σ 2 , namely σ 2 (H(θ skew )), then we can define the confidence measure as follows: As the denominator term represents the variance deviation for the found skew angle, we can see that as it increases, the confidence also increases. This is measuring how much of a difference the shift in the angle space has made for the variance when θ skew maximized it. τ 2 is a positive exponent used for fine-tuning the confidence in the voting process to match the capabilities of the other algorithms.

Mechanism
This method uses horizontal and vertical lines in the document detected via the Hough transform [28] to compute the skew angle. We look for both vertical and especially for horizontal lines, which are most likely to be detected within lines of text. The skew angle is computed as a weighted sum of angles given by the detected lines. For the summation, these angles are weighted by properties of the lines. The algorithm performs best when there are long lines in the document, such as tables and separators.
First, the Sobel [29] edge detection filter is applied, in order to empty the interior of the connected components. This prepares the image for the line detection procedure, eliminating shape filling pixels from being accounted in the line search procedure.
After we apply the probabilistic Hough transform, we are left with a set of line segments, each one representing some vertical or horizontal separator, some line of a table, some line in an image, a line of text, or some line emerging at a random angle from within the placement of the symbols, randomly cutting through a block of text. The former must have discarded, as it does not represent the orientation of the document. Thus, we must sort the obtained segments by some criteria. Lines outside SASS are discarded.
We take all pairs of segments and build an adjacency matrix which contains one in the cell [i,j] if line i and line j may be considered parallel and zero otherwise. Two lines are considered parallel if the absolute value of the dot product of their support unit vectors is greater than one minus some small threshold ε. The computed dot products are also saved in a matrix for later use.
We partition all segments into disjoint sets, each set containing lines claimed to be parallel, i.e., for any pair of lines within a set, the adjacency matrix has a value of 1. For this step, an optimized disjoint sets union-find data structure is used, with path compression for the find operation and union by size heuristics. For every resulting set of parallels, we compute an average line, referred to as the set representative, which gives the orientation of the set.
If all line segments are in one set, then the returned skew angle is equal to the angle between the representative line of the set and the closest primary axis.
If there is more than one set, we sort the sets descending by the sum of their segments' length and pick the first three or four sets (or two, if there are only two sets) as candidates for the horizontal and vertical axes of the document. If two of the sets have representatives that can be considered perpendicular, then the representatives will be picked as the document's horizontal and vertical axes and the skew angle is computed as a weighted sum of the angles each representative makes with the closest primary axis. If there are no perpendicular representatives, then the first set in the sorted candidates is picked as the horizontal or vertical axis of the document, depending on the angle it makes with the primary axes.
In Figure 6, document orientation axes are shown passing through the center of processed images (black background), and for some images (e.g., the three rightmost images) only one axis was detected.
Appl. Sci. 2020, 10, 2236 7 of 12 union by size heuristics. For every resulting set of parallels, we compute an average line, referred to as the set representative, which gives the orientation of the set. If all line segments are in one set, then the returned skew angle is equal to the angle between the representative line of the set and the closest primary axis.
If there is more than one set, we sort the sets descending by the sum of their segments' length and pick the first three or four sets (or two, if there are only two sets) as candidates for the horizontal and vertical axes of the document. If two of the sets have representatives that can be considered perpendicular, then the representatives will be picked as the document's horizontal and vertical axes and the skew angle is computed as a weighted sum of the angles each representative makes with the closest primary axis. If there are no perpendicular representatives, then the first set in the sorted candidates is picked as the horizontal or vertical axis of the document, depending on the angle it makes with the primary axes.
In Figure 6, document orientation axes are shown passing through the center of processed images (black background), and for some images (e.g., the three rightmost images) only one axis was detected.

Confidence
The confidence for this algorithm, Equation 4, is computed as a product of one or multiple accuracy indicators, depending on which case we are in.
For any set of parallel lines, we can compute the average absolute dot-product , of the support vectors for all possible line pairs within the set. The closer this average is to one, the more parallel we can consider the lines, so the greater the algorithm's confidence should be.
If we only have one set A, whose representative ⃗ is pronounced as one of the document's axes, then only the first factor ( ) is returned as the method's confidence. For two sets of segments, A and B, whose representatives ⃗ and ⃗ are pronounced as the document's vertical and horizontal axes, then we can account on how close to perpendicular ⃗ is to ⃗ for computing confidence, by computing their dot product. The second factor in the confidence formula increases the measure when ⃗ and ⃗ tend to be perpendicular and decreases it when ⃗ and ⃗ tend to be parallel.
The third factor is responsible for a contribution equal to the ratio between the sum of lengths of all segments in A and B, which we write as || || and || ||, respectively, and the sum of lengths of all segments in all sets, including A and B. This is performed in order to encourage the contribution of the longer segments in the final confidence formula because longer segments are both more reliably detected and their support angle is also more accurate.

Confidence
The confidence for this algorithm, Equation (4), is computed as a product of one or multiple accuracy indicators, depending on which case we are in.
For any set of parallel lines, we can compute the average absolute dot-product absDot, of the support vectors for all possible line pairs within the set. The closer this average is to one, the more parallel we can consider the lines, so the greater the algorithm's confidence should be.
If we only have one set A, whose representative → A is pronounced as one of the document's axes, then only the first factor ( absDot A) is returned as the method's confidence.
For two sets of segments, A and B, whose representatives The third factor is responsible for a contribution equal to the ratio between the sum of lengths of all segments in A and B, which we write as ||A|| and ||B||, respectively, and the sum of lengths of all segments in all sets, including A and B. This is performed in order to encourage the contribution of the longer segments in the final confidence formula because longer segments are both more reliably detected and their support angle is also more accurate.
Again, τ 3 is a positive exponent used for fine-tuning the confidence to achieve balance and accuracy in the voting process.

Voting
In Section 2, we described three algorithms to compute the skew angle of a document and an expression for each algorithm to define a confidence level in the work it has done, as a number ranging between zero and one. In this section, we describe mechanisms that use the result and confidence of each algorithm to accurately determine the skew angle.
Besides the skew angle, is also offered in the end an aggregate confidence measure of the entire system. A user of a real-world document image processing application may use this measure to select only low-confidence (if any) image candidates and inspect them visually. The aggregate confidence is obtained as analogous to the skew angle for each of the following three voting methods and, as a result, it will not be further detailed.

Best First Voting
This method is the most basic, as it picks the result of the algorithm which returned the greatest confidence measure, completely discarding the others. It performs as expected, masking faults of algorithms in various situations, but not combining any results if confidences are close.
Another version of this policy is returning a weighted (by confidence) sum (or simply a mean) of the first two candidates, descending by confidence, while completely discarding the third.

Weighted Voting
This policy returns a weighted sum of the results returned by the algorithms, where the weights employed in the voting process are exactly the confidences returned by every skew detection method. This has the advantage of combining the results to increase the tolerance to errors. However, there must be a confidence threshold, below which a candidate is discarded. Otherwise, candidates with low confidence, i.e., an inaccurate result, would hijack the candidates who performed successfully. The low/high-confidence threshold is empirically set at 50%.

Unanimous Voting
All candidates are associated with equal weights for the voting process, except the low-confidence ones (those below a threshold empirically set at 50%), in which case their weight in the voting process becomes null, as described in Section 3.2.

Tests and Results
Our test dataset contains over 300 images representing both documents scanned by hand and digital documents converted into images. By introducing a random skew angle on each, we were able to correctly determine the accuracy of the system, knowing a priori what it should return.
In Table 1, we show a benchmark performed on our digital dataset with introduced skew. The test dataset contains a large diversity of document formats: from illustrated textbooks to scientific papers, poetry books, working sheets, art magazines, newspapers, CAD parts and more. It has been carefully selected such that there is diversity in structure, fonts, content (images/text) and abundance of orientation information. This dataset has been processed by applying random skew angles to each image and repeatedly fed to all three algorithms and the voting process while measuring accuracy and performance for each step. The results shown in Table 1 contain the following: • execution time on a 64-bit Windows 7 system, i7-8550U CPU, using CPU-only OpenCV 4.01 routines, Microsoft Visual Studio 2017, with full compiler optimization for speed; • average error, computed as the average absolute value of the difference between the known introduced skew angle and the skew angle detected by our system, for success cases, i.e., if the confidence is greater than some threshold (empirically chosen 50%); • average confidence, taking into account both success and failure cases; • uptime: the ratio between the number of success cases and the number of total cases. Notes: a) The execution times marked with * were obtained using the article's demonstrator application. b) The execution times (and the subsequent deskewed images) marked with ** were obtained using the CCS's DocWorks 6.8 commercial solution for creating digital libraries [30]. As a result, they should be used only qualitatively. Also, since the confidence measures are not defined in this case and they were also considered when measuring the uptime, in these cases no uptime can be provided for comparison.
For fine-tuning the exponent parameters of each algorithm's confidence, we used the brute force search method, that is: for every triplet of exponents (τ 1 , τ 2 , τ 3 ) ∈ [1 − ϕ, 1 + ϕ] 3 , discrete interval, we compute the average error for the entire dataset for every voting process. We picked ϕ = 0.8 and the step between elements in the interval ε = 0.025.
The brute-force search results in the (τ 1 , τ 2 , τ 3 ) space are presented in Table 2. Anyway, the system is not extremely sensitive regarding τ 1 , τ 2 , τ 3 values, delivering robust results for most combinations as long as τ 1 < τ 2 < τ 3 . The Section Supplementary Materials contains a reference to the source code and the benchmarking dataset for this project.

Conclusions
The algorithms backed up for each other in failure cases, resulting in almost 100% uptime. The failure cases were, in fact, splash pages containing images, with almost no visual orientation information, such as lines aligned with the document's primary axes.
The Best First voting method turned out to give the best results. The average uptime for this voting policy is 99.5%, as only two images, which arguably do not contain any orientation information, were not deskewed correctly-both images contain organic paintings.
Having many algorithms attempting to solve the same problem, each performing on a different aspect of the problem, while being able to estimate a confidence measure of their work, is an optimal case for running all algorithms and returning a result determined by vote, which is closer to the ground truth than any of the candidate algorithms.
By fine-tuning the confidences of the candidate algorithms, we obtained a better uptime and overall accuracy.
The technology presented in this paper was developed and employed in the "Lib2Life-Revitalizing Libraries and Cultural Heritage through Advanced Technologies" project [31] aimed at both researching document image techniques suitable for automatic content creation and building of a large repository consisting from millions of pages of books, newspapers and journals with diverse content and spread across multiple time periods. Due to the large variations and unpredictability in the processing input, a strong need for algorithms that do not require user supervision emerged.
The voting-based deskew method was validated on a very demanding dataset, encompassing all kinds of complex layouts, tables, alignments and illustrations. Moreover, it was successfully used until now in the first part of the document image processing in the aforementioned project, on a little more than half a million pages and the results were visually examined by dense sampling. Very encouragingly, no erroneous rotations of the pages were reported, thus one can easily conclude that the proposed approach is indeed a very robust one, ideally suited for the mass production of digital content.
Supplementary Materials: The source code and the benchmarking dataset for this project may be found at the following link: https://gitlab.com/adix64/document-deskewer.