Development of an End-to-End Pipeline for Custom Key-Value Extraction from Commercial Invoices

Inefficiencies in manual extraction of information from business documents have resulted in the development of automated processing solutions. Within the scope of business documents, commercial invoices present additional complexities due to the diversity of document layouts and the variation in qua...

Full description

Bibliographic Details
Main Author: Mohan, Abhishek
Other Authors: Gupta, Amar
Format: Thesis
Published: Massachusetts Institute of Technology 2023
Online Access:https://hdl.handle.net/1721.1/150308
Description
Summary:Inefficiencies in manual extraction of information from business documents have resulted in the development of automated processing solutions. Within the scope of business documents, commercial invoices present additional complexities due to the diversity of document layouts and the variation in quality of scanned documents. Commercially available solutions have been built to perform invoice extraction, yet they do not provide flexibility in accomplishing tasks unique to a particular dataset and its associated complications. Using sample documents provided by a leading electronic component distributor, we researched different approaches capable of extracting key-value information from a complex dataset of invoices. The thesis provides a detailed look into the development of a highly accurate, end-to-end data pipeline accomplishing this task. A multi-module approach integrating image processing, optical character recognition, custom algorithms, and machine learning-based matching was built and compartmentalized into continuous stages - allowing for effective and efficient key-value extraction of information from invoice documents. In conjunction with an intuitive web interface, the custom pipeline provides a solution with strong performance and the flexibility to be generalized for extraction of additional business documents in future efforts.