AdaDS: Adaptive data selection for accelerating pre-trained language model knowledge distillation

Knowledge distillation (KD) is a widely used method for transferring knowledge from large teacher models to computationally efficient student models. Unfortunately, the computational cost of KD becomes unaffordable as pre-trained language models (PLMs) grow larger. Computing KD loss on only part of...

Full description

Bibliographic Details
Main Authors: Qinhong Zhou, Peng Li, Yang Liu, Yuyang Guan, Qizhou Xing, Ming Chen, Maosong Sun
Format: Article
Language:English
Published: KeAi Communications Co. Ltd. 2023-01-01
Series:AI Open
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666651023000074