Fiscal data in text: Information extraction from audit reports using Natural Language Processing

Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missin...

Full description

Bibliographic Details
Main Author:	Alejandro Beltran
Format:	Article
Language:	English
Published:	Cambridge University Press 2023-01-01
Series:	Data & Policy
Subjects:	auditing corruption natural language processing subnational governments text-as-data
Online Access:	https://www.cambridge.org/core/product/identifier/S2632324923000044/type/journal_article

_version_	1797857760440745984
author	Alejandro Beltran
author_facet	Alejandro Beltran
author_sort	Alejandro Beltran
collection	DOAJ
description	Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available.
first_indexed	2024-04-09T21:01:59Z
format	Article
id	doaj.art-955c6ae7c2284d69b8b3f62d96cd597d
institution	Directory Open Access Journal
issn	2632-3249
language	English
last_indexed	2024-04-09T21:01:59Z
publishDate	2023-01-01
publisher	Cambridge University Press
record_format	Article
series	Data & Policy
spelling	doaj.art-955c6ae7c2284d69b8b3f62d96cd597d2023-03-29T08:35:22ZengCambridge University PressData & Policy2632-32492023-01-01510.1017/dap.2023.4Fiscal data in text: Information extraction from audit reports using Natural Language ProcessingAlejandro Beltran0https://orcid.org/0000-0001-9544-7355The Alan Turing Institute, London, United KingdomSupreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available.https://www.cambridge.org/core/product/identifier/S2632324923000044/type/journal_articleauditingcorruptionnatural language processingsubnational governmentstext-as-data
spellingShingle	Alejandro Beltran Fiscal data in text: Information extraction from audit reports using Natural Language Processing Data & Policy auditing corruption natural language processing subnational governments text-as-data
title	Fiscal data in text: Information extraction from audit reports using Natural Language Processing
title_full	Fiscal data in text: Information extraction from audit reports using Natural Language Processing
title_fullStr	Fiscal data in text: Information extraction from audit reports using Natural Language Processing
title_full_unstemmed	Fiscal data in text: Information extraction from audit reports using Natural Language Processing
title_short	Fiscal data in text: Information extraction from audit reports using Natural Language Processing
title_sort	fiscal data in text information extraction from audit reports using natural language processing
topic	auditing corruption natural language processing subnational governments text-as-data
url	https://www.cambridge.org/core/product/identifier/S2632324923000044/type/journal_article
work_keys_str_mv	AT alejandrobeltran fiscaldataintextinformationextractionfromauditreportsusingnaturallanguageprocessing

Fiscal data in text: Information extraction from audit reports using Natural Language Processing

Similar Items