Summary: | As documents are one of the main tools for storing and communicating information, there have been a large amount of eff orts towards developing methods to parse information from them automatically. While many parts of this industry are automated, there are still scenarios where certain types of documents cannot be read by machine with high accuracy and throughput. It becomes especially more difficult when the documents are semi-structured, or in other words have widely varying formats. With the significant leaps in optical character recognition, computer vision, and natural language processing, there have been great progress towards this problem. In this paper, we propose two pipeline designs that utilize these newer techniques to extract information from semi-structured documents in a structured output format. The two pipelines are the fully automated pipeline and semi automated pipeline. The fully automated pipeline has a region detection module that finds the location of text blocks and table blocks regardless of the format of the document and a region extraction module that extracts information from each of the text and table blocks. The semi automated pipeline on the other hand has a classification module and an extraction module. The classification module determines the format class of the input document, while the extraction module has templates that can parse information from the documents in each format class. We evaluate the two pipelines in four key metrics: accuracy, coverage, time efficiency, and scalability. The fully automated pipeline shows a strong result in coverage and scalability, while the semi automated pipeline succeeds in accuracy and time efficiency.
|