Summary: | Conventional paraphrase identification (PI) models based on deep learning usually focus on text representation and ignore the mining and matching of multi-granular interaction features. In addition, supervised learning relies on a large labeled data. However, labeled training set for PI is small in comparison with the high complexity of the task. To solve the problems, we propose a semi-supervised deep learning framework for PI. We use a neural encoder with word-by-word attention mechanism to reason equivalence or contradiction over pairs of words, phrases and sentences. We employ a two-stage training procedure. First, we use a language modeling objective to learn the initial parameters on the unlabeled corpora of more than one million pairs of sentences. This is followed by a supervised training, where we adapt these parameters to a specific classification task with labeled data. Experimental results on MRPC (Microsoft Research Paraphrase Corpus) and SICK (Sentences Involving Compositional Knowledge) datasets demonstrate the effectiveness of our approach. Compared with the previous neural network models, we achieve absolute improvements in accuracy of 7.6% and F1 of 5.4% on MRPC, Pearson's r of 4.5% and Spearman's ρ of 5.1% on SICK.
|