Companies are relying more on artificial intelligence and machine learning in order to enhance and automate existing business processes. While the power of OCR (Optical Character Recognition) technologies can be harnessed for the digitization of image data, the digitalized text still needs to be validated and enhanced to ensure that data quality standards are met for the data to be usable. This research paper focuses on finding and creating an automated workflow that can follow image digitization and produce a dictionary consisting of the desired information. The workflow introduced consists of a three-step process that is implemented after the OCR output has been generated. With the introduction of each step, the accuracy of key-value matches of field names and values is increased. The first step takes the raw OCR output and identifies field names using exact string matching and field-values using regular expressions from an externally maintained file. The second step introduces index pairing that matches field-values to field names based on the location of the field name and value on the document. Finally, approximate string matching is introduced to the workflow, which increases accuracy. By implementing these steps, the F-measure for key-value pair matches is measured at 60.18% in the first step, 80.61% once index pairing is introduced, and finally 90.06% after approximate string matching is introduced. The research proved that accurate usable data can be obtained automatically from images with the implementation of a workflow after OCR.
This is an open access article distributed under the Creative Commons Attribution License
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited