Towards Deployable Robust Text Classifiers

Text classification has been studied for decades as a fundamental task in natural language processing. Deploying classifiers enables more efficient information processing, which is useful for various applications, including decision-making. However, classifiers also present challenging and long-stan...

Full description

Bibliographic Details
Main Author: Xu, Lei
Other Authors: Veeramachaneni, Kalyan
Format: Thesis
Published: Massachusetts Institute of Technology 2023
Online Access:https://hdl.handle.net/1721.1/150071
_version_ 1826206303944441856
author Xu, Lei
author2 Veeramachaneni, Kalyan
author_facet Veeramachaneni, Kalyan
Xu, Lei
author_sort Xu, Lei
collection MIT
description Text classification has been studied for decades as a fundamental task in natural language processing. Deploying classifiers enables more efficient information processing, which is useful for various applications, including decision-making. However, classifiers also present challenging and long-standing problems. As their use increases, expectations about their level of robustness, fairness, accuracy, and other metrics increase in turn. In this dissertation, we aim to develop more deployable and robust text classifiers, with a focus on improving classifier robustness against adversarial attacks by developing both attack and defense approaches. Adversarial attacks are a security concern for text classifiers, as they involve cases where a malicious user takes a sentence and perturbs it slightly to manipulate the classifier’s output. To design more effective attack methods, we focus first on improving adversarial sentence quality – unlike existing methods that prioritize misclassification and ignore sentence similarity and fluency, we synthesize these three criteria into a combined critique score. We then outline a rewrite and rollback framework for optimizing this score and achieving state-of-theart attack success rates while improving similarity and fluency. We focus second on computational requirements. Existing methods typically use combinatorial search to find adversarial examples that alter multiple words, which are inefficient and require many queries to the classifier. We overcome this problem by proposing a single-word adversarial perturbation attack. This attack only needs to replace a single word in the original sentence with a high-adversarial-capacity word, significantly improving efficiency while the attack success rate remains similar to that of existing methods. We then turn to defense. Currently, the most common approach for defending against attacks is training classifiers using adversarial examples as data augmentation, a method limited by the inefficiency of many attack methods. We show that training classifiers with data augmentation through our efficient single-word perturbation attack can improve the robustness of the classifier against other attack methods. We also design in situ data augmentation to counteract adversarial perturbations in the classifier input. We use the gradient norm to identify keywords for classification and a pre-trained language model to replace them. Our in situ augmentation can effectively improve robustness and does not require tuning the classifier. Finally, we explore the vulnerability of a very recent text classification architecture – prompt-based classifiers — and find them to be vulnerable to attacks as well. We also develop a library called Fibber to facilitate adversarial robustness research.
first_indexed 2024-09-23T13:27:20Z
format Thesis
id mit-1721.1/150071
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T13:27:20Z
publishDate 2023
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1500712023-04-01T03:28:05Z Towards Deployable Robust Text Classifiers Xu, Lei Veeramachaneni, Kalyan Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Text classification has been studied for decades as a fundamental task in natural language processing. Deploying classifiers enables more efficient information processing, which is useful for various applications, including decision-making. However, classifiers also present challenging and long-standing problems. As their use increases, expectations about their level of robustness, fairness, accuracy, and other metrics increase in turn. In this dissertation, we aim to develop more deployable and robust text classifiers, with a focus on improving classifier robustness against adversarial attacks by developing both attack and defense approaches. Adversarial attacks are a security concern for text classifiers, as they involve cases where a malicious user takes a sentence and perturbs it slightly to manipulate the classifier’s output. To design more effective attack methods, we focus first on improving adversarial sentence quality – unlike existing methods that prioritize misclassification and ignore sentence similarity and fluency, we synthesize these three criteria into a combined critique score. We then outline a rewrite and rollback framework for optimizing this score and achieving state-of-theart attack success rates while improving similarity and fluency. We focus second on computational requirements. Existing methods typically use combinatorial search to find adversarial examples that alter multiple words, which are inefficient and require many queries to the classifier. We overcome this problem by proposing a single-word adversarial perturbation attack. This attack only needs to replace a single word in the original sentence with a high-adversarial-capacity word, significantly improving efficiency while the attack success rate remains similar to that of existing methods. We then turn to defense. Currently, the most common approach for defending against attacks is training classifiers using adversarial examples as data augmentation, a method limited by the inefficiency of many attack methods. We show that training classifiers with data augmentation through our efficient single-word perturbation attack can improve the robustness of the classifier against other attack methods. We also design in situ data augmentation to counteract adversarial perturbations in the classifier input. We use the gradient norm to identify keywords for classification and a pre-trained language model to replace them. Our in situ augmentation can effectively improve robustness and does not require tuning the classifier. Finally, we explore the vulnerability of a very recent text classification architecture – prompt-based classifiers — and find them to be vulnerable to attacks as well. We also develop a library called Fibber to facilitate adversarial robustness research. Ph.D. 2023-03-31T14:29:52Z 2023-03-31T14:29:52Z 2023-02 2023-02-28T14:39:49.867Z Thesis https://hdl.handle.net/1721.1/150071 0000-0002-3846-4228 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Xu, Lei
Towards Deployable Robust Text Classifiers
title Towards Deployable Robust Text Classifiers
title_full Towards Deployable Robust Text Classifiers
title_fullStr Towards Deployable Robust Text Classifiers
title_full_unstemmed Towards Deployable Robust Text Classifiers
title_short Towards Deployable Robust Text Classifiers
title_sort towards deployable robust text classifiers
url https://hdl.handle.net/1721.1/150071
work_keys_str_mv AT xulei towardsdeployablerobusttextclassifiers