Fiscal data in text: Information extraction from audit reports using Natural Language Processing

Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missin...

Full description

Bibliographic Details
Main Author: Alejandro Beltran
Format: Article
Language:English
Published: Cambridge University Press 2023-01-01
Series:Data & Policy
Subjects:
Online Access:https://www.cambridge.org/core/product/identifier/S2632324923000044/type/journal_article
_version_ 1797857760440745984
author Alejandro Beltran
author_facet Alejandro Beltran
author_sort Alejandro Beltran
collection DOAJ
description Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available.
first_indexed 2024-04-09T21:01:59Z
format Article
id doaj.art-955c6ae7c2284d69b8b3f62d96cd597d
institution Directory Open Access Journal
issn 2632-3249
language English
last_indexed 2024-04-09T21:01:59Z
publishDate 2023-01-01
publisher Cambridge University Press
record_format Article
series Data & Policy
spelling doaj.art-955c6ae7c2284d69b8b3f62d96cd597d2023-03-29T08:35:22ZengCambridge University PressData & Policy2632-32492023-01-01510.1017/dap.2023.4Fiscal data in text: Information extraction from audit reports using Natural Language ProcessingAlejandro Beltran0https://orcid.org/0000-0001-9544-7355The Alan Turing Institute, London, United KingdomSupreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available.https://www.cambridge.org/core/product/identifier/S2632324923000044/type/journal_articleauditingcorruptionnatural language processingsubnational governmentstext-as-data
spellingShingle Alejandro Beltran
Fiscal data in text: Information extraction from audit reports using Natural Language Processing
Data & Policy
auditing
corruption
natural language processing
subnational governments
text-as-data
title Fiscal data in text: Information extraction from audit reports using Natural Language Processing
title_full Fiscal data in text: Information extraction from audit reports using Natural Language Processing
title_fullStr Fiscal data in text: Information extraction from audit reports using Natural Language Processing
title_full_unstemmed Fiscal data in text: Information extraction from audit reports using Natural Language Processing
title_short Fiscal data in text: Information extraction from audit reports using Natural Language Processing
title_sort fiscal data in text information extraction from audit reports using natural language processing
topic auditing
corruption
natural language processing
subnational governments
text-as-data
url https://www.cambridge.org/core/product/identifier/S2632324923000044/type/journal_article
work_keys_str_mv AT alejandrobeltran fiscaldataintextinformationextractionfromauditreportsusingnaturallanguageprocessing