In the data-driven age of today it is imperative for businesses to utilize their data in an efficient and timely manner in order to gain a competitive advantage. One of the major problems many organizations face however, is that most of their data assets (±80%) are in a semi- or unstructured format [1
] resulting in an investment in costly natural language processing techniques in order to digitize the data [4
]. The aim of this research is to provide a practical business automation workflow that can be implemented or integrated into an existing system in order to produce accurate digitized image data, while avoiding the possible errors introduced by manual entry of crucial identity attributes of entities [5
]. Digitization is the process of turning information into a form that can be read easily by a computer [4
]. Many research papers cover the use of OCR (Optical Character Recognition) technologies available on the market today as well as provide comparisons between the various technologies [6
]. The research conducted for this paper specifically focused on automating, validating, and accurately preparing desired data for use in an existing business process or technology. More detailed discussions about the motivation for the workflow as well as each step built into the workflow will be discussed in the ‘Related Work’ section below.
The novel full image digitization workflow process as suggested by the authors can be seen below (Figure 1
). Each step will be discussed and justified in detail and was designed to overcome the limitations or problem areas of the step before it.
The raw extracted text that is generated from the OCR process forms the starting point of the business process automation. Predefined domain-specific field labels are used in the process of identifying field names from the extracted text. This is similar to the approach of Ford et al. [9
] that also only used a dictionary of domain-specific words during pattern matching to narrow down the possible candidates of words to match against. The predefined field labels are maintained in an external document alias file with the corresponding regular expression as a matching field-value. Regular expressions were chosen as a means to help identify candidate values for a field as the values for each key share a common structure but cannot be expressed in a finite dictionary. Key-value matching becomes more complicated where there are multiple string values in the OCR output file that conform to a specific regular expression pattern [10
]. The solution to this was to use the field name position on the document and to match the first successful pattern that was located closest to the field name. There is very little literature that takes this approach, as the focus of many of the studies was simply to suggest corrections on a GUI (graphical user interface) [9
] or to measure correctness of the OCR tools. Other studies aimed to determine which OCR tools performed the best and give a quantitative result [6
]. The goal of this study was to produce a working system that generated the key-value pair dictionary for system consumption. To further improve the accuracy of the matches, ASM (approximate string matching) was added to the process in order to ensure that provision was made for small reading errors [11
]. Once the program had made the key-value matches, it created a dictionary in JSON format as a final result. With the output structure being a JSON dictionary, any system that can consume such a dictionary would be able to use this process as an initial step in digitizing data.
A key objective was to maximize the amount of correct key-value matches (where the key and its corresponding value were both correctly read from the images and paired together correctly), and to minimize incorrect key-value matches (where a key would be matched with the incorrect value or where a key would be matched with no value). Another objective of the workflow was to standardize and correct any minor spelling mistakes resulting from the OCR failing to recognize a character correctly. Both these objectives contributed to the overall project goal which was to get the most accurate digital dictionary representation of the image file which could be utilized in another application or existing business system by creating a chain of processes that each overcame the limitations of the process before it.
In the next section, related work and the current state of the art solutions will be discussed and compared, and more justification will be provided for using the proposed workflow and steps to solve this problem. The section thereafter will address the implementation details and what steps were taken to attempt the digitalization process. That will be followed by a proposed evaluation technique to measure the success of the proposed workflow. The final sections include the conclusion and future work that discusses the results of the experiment along with future work for the next stages or continued work on this problem.
4. Implementation Details
The project goal was to implement an accurate business automation workflow that combined available technologies and tools to extract and digitalize data from images in order to create an accurate, universally accepted output file (key-value pairs) that could be ingested by any external system or process.
4.1. Project Input Files (Dataset)
The data input files used in developing the automation workflow were formal, publicly available tax clearance certificates (formal certificates/affidavits) that are normally obtained by registered organizations from the South African Revenue Service. The research decision on which type of image to use in the dataset was based on a typical industry requirement, with the aim of making the process extendible for other similar documents. The dataset consisted of 19 tax clearance certificate images that were scanned copies of the original documents and varied in type (JPEG vs. PNG), quality, and number of fields. An example of the image format is shown in Figure 3
. The annotations on the image are explained in Table 1
in order to illustrate the effective data to be derived from the image for key-value matching pairs to be formed.
As can be seen from the annotations in Figure 3
and the descriptions in Table 1
the tax clearance certificate images contain both relevant and irrelevant fields and blocks of text. All the data from annotations 1 and 4, although available in the digitized image, were ignored for key-value identification and pairing. The key/field names found in annotation 2 are not fixed to what can be seen in Figure 3
but vary between the images in the dataset. Subsequently, the same then applies for the values/field values as displayed in annotation 3 as a field value would be available for the corresponding key/field name.
The digitized output result that needs to be derived from the images is the field names (annotation 2) and corresponding field-values (annotation 3) in a dictionary format, that excludes all of the data that is represented by annotations 1 and 4 in Figure 3
The keys/field-names are predefined labels of tax characteristics that a company can have. A generated list of such terms had been compiled based on the dataset and was maintained in an external alias file (Figure 1
). Values/field-values are the corresponding tax values relating to a specific key. These field-values mostly have a set structure that is successfully representable by regular expressions. Regular expressions were generated by studying the population of tax documents and is also maintained externally in the alias file (Figure 1
) where it was subsequently mapped to the corresponding key (Table 2
4.2. Project Approach
The project is a step-by-step workflow starting with the use of OCR (Step 1) to get raw text from images. Steps were iteratively added to the system (Steps 2,3) to better process the raw text to identify and match the field names and corresponding field-values. Field names and field-values were coupled correctly by the use of various methods to ensure the resulting key-value pair structure accurately reflected the information on the tax documents. The steps identified below were a process of refinement which are discussed individually in order to explain how each of them improved the accuracy of the key-value coupling.
Step 1: Raw OCR output: Initially GCV OCR was used to read the text from the tax documents and return it as a text blob. A ground truth for each tax image file was created manually by mimicking the format returned by the GCV service. Once GCV had returned a text blob the program identified field labels by iterating through each line using exact string matching (ESM) from keywords existing in the alias file. Regular expressions were used for the identification of the corresponding tax values for each key in the alias file. F-measure, precision and recall were recorded for key-value matches when comparing the results to the ground truth. Analysis of the results showed that the most common causes of inaccuracy were due to misread characters and line ordering. The resulting effect was that field labels could often not be matched to any of the values in the text blobs, or the key and values were not in the expected order and thus were not matched together.
Step 2: Introducing index pairing: To improve the pairing results and cater for lines that were read and processed out of order, index pairing was introduced in the place of transforming the raw OCR output files. Index pairing provided a more dynamic solution for image types where the OCR output file format would differ. It often occurred that the OCR output file had a value preceding its matching key in the text file. The observation was also made that a regular expression often matched a few different expressions in the file, especially the field-values with a less distinguishable intrinsic structure. Due to the layout of the documents, the true pairs of keys and values were located close to each other on the document itself, and it followed that the correct value for each key was often close to the key, even if it is not adjacent to it. The system was refined to choose a value candidate for each key out of the pool of candidates that matched the regular expression, based in part on its proximity to the corresponding key (index pairing). Figure 4
illustrates the key and value fields being read out of order at times.
Step 3: Introduce Approximate String Matching (ASM): In order to decrease the impact of reading inaccuracies by the OCR, ASM was introduced to the workflow instead of ESM to identify field names. Figure 5
shows the typical case study for using ASM. Levenshtein was used to measure the distance between two strings and set the acceptable threshold to 90% similarity for a string to be deemed equivalent to another. After two strings were deemed to be equivalent, the corrected string was used in the resulting dictionary to ensure the keys produced by the system all matched one of the predefined tax characteristics in the alias file.
6. Results and Comparison
The detailed results for each metric at each step in the workflow were documented across the population of 19 files. Averages calculated from the detailed results are shown in the graphs below.
As can be seen in Figure 7
, a moderate number of matches were made initially, but only 56.34% of those matches were correct. This is to be expected as a specific value could often satisfy a few of the regular expression patterns. The system had no intelligence about which one of the many values that match the regular expression for a key was the correct one and often simply selected the first one it could find after finding a key. After the introduction of index pairing, the precision grew considerably as not only more matches but also more correct matches were made. The increased number of matches was because the system now had the ability to read values above and below the key to find potential matches that would match the regular expression. The fact that there were more correct matches was because the irregularity that the OCR tools sometimes showed of placing a value above its key was overcome—the index pairing step addressed the instances where the OCR tool read lines out of order. A correct value could now be found for a key regardless of whether the key or value occurred first in the text. Finally, using ASM to find the keys in the files increased the precision further. Not many more matches were made, but the matches made were correct much more frequently. ASM thus assisted in overcoming the instances where OCR misread a word due to noise or document quality issues.
In the raw OCR step the system identified 68.7% of the true matches. As can be seen in Figure 8
, the introduction of index pairing (Step 2) increased the recall, showing that the system had a better ability to find the actual matches from the pool of possible matches. Again, this was due to the disorganized way in which the keys and values were read by OCR. Finally, the recall increased dramatically and almost all the true pairs were identified from the data text.
As can be seen in Figure 9
, the F-measure significantly increased after index pairing (Step 2) was introduced to the workflow and further significantly improved once ASM was added. The precision, recall and F-measure conclusions are well supported by the above graphs.
a shows a dramatic increase in the amount of true positive matches from the start point of the system to the end point. The matches made were increasingly more correct by overcoming the flaws of the OCR tools one by one—first by reorganizing the way the document output was read, and then by allowing some level of misread characters to be processed correctly. Just like the precision graph, the average shot up from the first step to the second, and then had another less dramatic increase. This trend in combination with the dramatic drop of false negative matches (as seen in Figure 10
b) contribute to the overall shape of precision and recall.