Fiscal data in text: Information extraction from audit reports using Natural Language Processing
Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missin...
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
Cambridge University Press
2023-01-01
|
Series: | Data & Policy |
Subjects: | |
Online Access: | https://www.cambridge.org/core/product/identifier/S2632324923000044/type/journal_article |
_version_ | 1797857760440745984 |
---|---|
author | Alejandro Beltran |
author_facet | Alejandro Beltran |
author_sort | Alejandro Beltran |
collection | DOAJ |
description | Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available. |
first_indexed | 2024-04-09T21:01:59Z |
format | Article |
id | doaj.art-955c6ae7c2284d69b8b3f62d96cd597d |
institution | Directory Open Access Journal |
issn | 2632-3249 |
language | English |
last_indexed | 2024-04-09T21:01:59Z |
publishDate | 2023-01-01 |
publisher | Cambridge University Press |
record_format | Article |
series | Data & Policy |
spelling | doaj.art-955c6ae7c2284d69b8b3f62d96cd597d2023-03-29T08:35:22ZengCambridge University PressData & Policy2632-32492023-01-01510.1017/dap.2023.4Fiscal data in text: Information extraction from audit reports using Natural Language ProcessingAlejandro Beltran0https://orcid.org/0000-0001-9544-7355The Alan Turing Institute, London, United KingdomSupreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available.https://www.cambridge.org/core/product/identifier/S2632324923000044/type/journal_articleauditingcorruptionnatural language processingsubnational governmentstext-as-data |
spellingShingle | Alejandro Beltran Fiscal data in text: Information extraction from audit reports using Natural Language Processing Data & Policy auditing corruption natural language processing subnational governments text-as-data |
title | Fiscal data in text: Information extraction from audit reports using Natural Language Processing |
title_full | Fiscal data in text: Information extraction from audit reports using Natural Language Processing |
title_fullStr | Fiscal data in text: Information extraction from audit reports using Natural Language Processing |
title_full_unstemmed | Fiscal data in text: Information extraction from audit reports using Natural Language Processing |
title_short | Fiscal data in text: Information extraction from audit reports using Natural Language Processing |
title_sort | fiscal data in text information extraction from audit reports using natural language processing |
topic | auditing corruption natural language processing subnational governments text-as-data |
url | https://www.cambridge.org/core/product/identifier/S2632324923000044/type/journal_article |
work_keys_str_mv | AT alejandrobeltran fiscaldataintextinformationextractionfromauditreportsusingnaturallanguageprocessing |