AdaDS: Adaptive data selection for accelerating pre-trained language model knowledge distillation

Knowledge distillation (KD) is a widely used method for transferring knowledge from large teacher models to computationally efficient student models. Unfortunately, the computational cost of KD becomes unaffordable as pre-trained language models (PLMs) grow larger. Computing KD loss on only part of...

Full description

Bibliographic Details
Main Authors: Qinhong Zhou, Peng Li, Yang Liu, Yuyang Guan, Qizhou Xing, Ming Chen, Maosong Sun
Format: Article
Language:English
Published: KeAi Communications Co. Ltd. 2023-01-01
Series:AI Open
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666651023000074
Description
Summary:Knowledge distillation (KD) is a widely used method for transferring knowledge from large teacher models to computationally efficient student models. Unfortunately, the computational cost of KD becomes unaffordable as pre-trained language models (PLMs) grow larger. Computing KD loss on only part of the training set is a promising way to accelerate KD. However, existing works heuristically leverage only one static data selection strategy during the KD process, demonstrating inconsistent improvements across different distillation scenarios. In this work, we conduct a thorough study on various typical data selection strategies for KD, and show that this problem is due to the fact that the best data selection strategy is specific to various factors, including task, selected data size, and training stage. To automatically adapt to these factors, we propose a framework named AdaDS to learn to choose the data selection strategy adaptively during the KD process. Experimental results show that our proposed method is effective for various tasks and selected data sizes under both fine-tuning and pre-training stages, achieving comparable performance to DistilBERT with only 10% amount of queries to the teacher model.
ISSN:2666-6510