Summary: | Abstract Recent advances in deep neural networks have achieved outstanding success in natural language processing tasks. Interpretation methods that provide insight into the decision-making process of these models have received an influx of research attention because of the success and the black-box nature of the deep text classification models. Evaluation of these methods has been based on changes in classification accuracy or prediction confidence when removing important words identified by these methods. There are no measurements of the actual difference between the predicted important words and humans’ interpretation of ground truth because of the lack of interpretation ground truth. A large publicly available interpretation ground truth has the potential to advance the development of interpretation methods. Manual labeling important words for each document to create a large interpretation ground truth is very time-consuming. This paper presents (1) IDC, a new benchmark for quantitative evaluation of interpretation methods for deep text classification models, and (2) evaluation of six interpretation methods using the benchmark. The IDC benchmark consists of: (1) Three methods that generate three pseudo-interpretation ground truth datasets. (2) Three performance metrics: interpretation recall, interpretation precision, and Cohen’s kappa inter-agreement. Findings: IDC-generated interpretation ground truth agrees with human annotators on sampled movie reviews. IDC identifies Layer-wise Relevance Propagation and the gradient-by-input methods as the winning interpretation methods in this study.
|