Data Extraction via Semantic Regular Expression Synthesis

Many data extraction tasks of practical relevance require not only syntactic pattern matching but also semantic reasoning about the content of the underlying text. While regular expressions are very well suited for tasks that require only syntactic pattern matching, they fall short for data extracti...

Full description

Bibliographic Details
Main Authors: Chen, Qiaochu, Banerjee, Arko, Demiralp, ?a?atay, Durrett, Greg, Dillig, I??l
Format: Article
Language:English
Published: ACM 2023
Online Access:https://hdl.handle.net/1721.1/152906
_version_ 1811072393803726848
author Chen, Qiaochu
Banerjee, Arko
Demiralp, ?a?atay
Durrett, Greg
Dillig, I??l
author_facet Chen, Qiaochu
Banerjee, Arko
Demiralp, ?a?atay
Durrett, Greg
Dillig, I??l
author_sort Chen, Qiaochu
collection MIT
description Many data extraction tasks of practical relevance require not only syntactic pattern matching but also semantic reasoning about the content of the underlying text. While regular expressions are very well suited for tasks that require only syntactic pattern matching, they fall short for data extraction tasks that involve both a syntactic and semantic component. To address this issue, we introduce semantic regexes, a generalization of regular expressions that facilitates combined syntactic and semantic reasoning about textual data. We also propose a novel learning algorithm that can synthesize semantic regexes from a small number of positive and negative examples. Our proposed learning algorithm uses a combination of neural sketch generation and compositional type-directed synthesis for fast and effective generalization from a small number of examples. We have implemented these ideas in a new tool called Smore and evaluated it on representative data extraction tasks involving several textual datasets. Our evaluation shows that semantic regexes can better support complex data extraction tasks than standard regular expressions and that our learning algorithm significantly outperforms existing tools, including state-of-the-art neural networks and program synthesis tools.
first_indexed 2024-09-23T09:05:19Z
format Article
id mit-1721.1/152906
institution Massachusetts Institute of Technology
language English
last_indexed 2024-09-23T09:05:19Z
publishDate 2023
publisher ACM
record_format dspace
spelling mit-1721.1/1529062023-11-04T03:49:26Z Data Extraction via Semantic Regular Expression Synthesis Chen, Qiaochu Banerjee, Arko Demiralp, ?a?atay Durrett, Greg Dillig, I??l Many data extraction tasks of practical relevance require not only syntactic pattern matching but also semantic reasoning about the content of the underlying text. While regular expressions are very well suited for tasks that require only syntactic pattern matching, they fall short for data extraction tasks that involve both a syntactic and semantic component. To address this issue, we introduce semantic regexes, a generalization of regular expressions that facilitates combined syntactic and semantic reasoning about textual data. We also propose a novel learning algorithm that can synthesize semantic regexes from a small number of positive and negative examples. Our proposed learning algorithm uses a combination of neural sketch generation and compositional type-directed synthesis for fast and effective generalization from a small number of examples. We have implemented these ideas in a new tool called Smore and evaluated it on representative data extraction tasks involving several textual datasets. Our evaluation shows that semantic regexes can better support complex data extraction tasks than standard regular expressions and that our learning algorithm significantly outperforms existing tools, including state-of-the-art neural networks and program synthesis tools. 2023-11-03T20:26:18Z 2023-11-03T20:26:18Z 2023-10-16 2023-11-01T07:57:57Z Article http://purl.org/eprint/type/JournalArticle 2475-1421 https://hdl.handle.net/1721.1/152906 Chen, Qiaochu, Banerjee, Arko, Demiralp, ?a?atay, Durrett, Greg and Dillig, I??l. 2023. "Data Extraction via Semantic Regular Expression Synthesis." Proceedings of the ACM on Programming Languages, 7 (OOPSLA2). PUBLISHER_CC en https://doi.org/10.1145/3622863 Proceedings of the ACM on Programming Languages Creative Commons Attribution https://creativecommons.org/licenses/by/4.0/ The author(s) application/pdf ACM Association for Computing Machinery
spellingShingle Chen, Qiaochu
Banerjee, Arko
Demiralp, ?a?atay
Durrett, Greg
Dillig, I??l
Data Extraction via Semantic Regular Expression Synthesis
title Data Extraction via Semantic Regular Expression Synthesis
title_full Data Extraction via Semantic Regular Expression Synthesis
title_fullStr Data Extraction via Semantic Regular Expression Synthesis
title_full_unstemmed Data Extraction via Semantic Regular Expression Synthesis
title_short Data Extraction via Semantic Regular Expression Synthesis
title_sort data extraction via semantic regular expression synthesis
url https://hdl.handle.net/1721.1/152906
work_keys_str_mv AT chenqiaochu dataextractionviasemanticregularexpressionsynthesis
AT banerjeearko dataextractionviasemanticregularexpressionsynthesis
AT demiralpaatay dataextractionviasemanticregularexpressionsynthesis
AT durrettgreg dataextractionviasemanticregularexpressionsynthesis
AT dilligil dataextractionviasemanticregularexpressionsynthesis