Ten pairs to tag - Multilingual POS tagging via coarse mapping between embeddings

In the absence of annotations in the target language, multilingual models typically draw on extensive parallel resources. In this paper, we demonstrate that accurate multilingual partof-speech (POS) tagging can be done with just a few (e.g., ten) word translation pairs. We use the translation pairs...

Full description

Bibliographic Details
Main Authors: Gaddy, David M., Zhang, Yuan, Barzilay, Regina, Jaakkola, Tommi S.
Other Authors: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format: Article
Language:en_US
Published: Association for Computational Linguistics 2017
Online Access:http://hdl.handle.net/1721.1/110739
https://orcid.org/0000-0003-3121-0185
https://orcid.org/0000-0002-2921-8201
https://orcid.org/0000-0002-2199-0379
_version_ 1826214964694614016
author Gaddy, David M.
Zhang, Yuan
Barzilay, Regina
Jaakkola, Tommi S.
author2 Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
author_facet Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Gaddy, David M.
Zhang, Yuan
Barzilay, Regina
Jaakkola, Tommi S.
author_sort Gaddy, David M.
collection MIT
description In the absence of annotations in the target language, multilingual models typically draw on extensive parallel resources. In this paper, we demonstrate that accurate multilingual partof-speech (POS) tagging can be done with just a few (e.g., ten) word translation pairs. We use the translation pairs to establish a coarse linear isometric (orthonormal) mapping between monolingual embeddings. This enables the supervised source model expressed in terms of embeddings to be used directly on the target language. We further refine the model in an unsupervised manner by initializing and regularizing it to be close to the direct transfer model. Averaged across six languages, our model yields a 37.5% absolute improvement over the monolingual prototypedriven method (Haghighi and Klein, 2006) when using a comparable amount of supervision. Moreover, to highlight key linguistic characteristics of the generated tags, we use them to predict typological properties of languages, obtaining a 50% error reduction relative to the prototype model
first_indexed 2024-09-23T16:14:25Z
format Article
id mit-1721.1/110739
institution Massachusetts Institute of Technology
language en_US
last_indexed 2024-09-23T16:14:25Z
publishDate 2017
publisher Association for Computational Linguistics
record_format dspace
spelling mit-1721.1/1107392022-09-29T19:03:35Z Ten pairs to tag - Multilingual POS tagging via coarse mapping between embeddings Gaddy, David M. Zhang, Yuan Barzilay, Regina Jaakkola, Tommi S. Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Gaddy, David M. Zhang, Yuan Barzilay, Regina Jaakkola, Tommi S. In the absence of annotations in the target language, multilingual models typically draw on extensive parallel resources. In this paper, we demonstrate that accurate multilingual partof-speech (POS) tagging can be done with just a few (e.g., ten) word translation pairs. We use the translation pairs to establish a coarse linear isometric (orthonormal) mapping between monolingual embeddings. This enables the supervised source model expressed in terms of embeddings to be used directly on the target language. We further refine the model in an unsupervised manner by initializing and regularizing it to be close to the direct transfer model. Averaged across six languages, our model yields a 37.5% absolute improvement over the monolingual prototypedriven method (Haghighi and Klein, 2006) when using a comparable amount of supervision. Moreover, to highlight key linguistic characteristics of the generated tags, we use them to predict typological properties of languages, obtaining a 50% error reduction relative to the prototype model 2017-07-17T18:13:57Z 2017-07-17T18:13:57Z 2016-06 Article http://purl.org/eprint/type/ConferencePaper 978-1-941643-91-4 http://hdl.handle.net/1721.1/110739 Zhang, Yuan et al. "Ten Pairs to Tag - Multilingual POS Tagging via Course Mapping between Embeddings." 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, USA, 12-17 June, 2016. Association for Computational Linguistics, 2016. https://orcid.org/0000-0003-3121-0185 https://orcid.org/0000-0002-2921-8201 https://orcid.org/0000-0002-2199-0379 en_US http://dblp.dagstuhl.de/db/conf/naacl/naacl2016.html 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Association for Computational Linguistics MIT Web Domain
spellingShingle Gaddy, David M.
Zhang, Yuan
Barzilay, Regina
Jaakkola, Tommi S.
Ten pairs to tag - Multilingual POS tagging via coarse mapping between embeddings
title Ten pairs to tag - Multilingual POS tagging via coarse mapping between embeddings
title_full Ten pairs to tag - Multilingual POS tagging via coarse mapping between embeddings
title_fullStr Ten pairs to tag - Multilingual POS tagging via coarse mapping between embeddings
title_full_unstemmed Ten pairs to tag - Multilingual POS tagging via coarse mapping between embeddings
title_short Ten pairs to tag - Multilingual POS tagging via coarse mapping between embeddings
title_sort ten pairs to tag multilingual pos tagging via coarse mapping between embeddings
url http://hdl.handle.net/1721.1/110739
https://orcid.org/0000-0003-3121-0185
https://orcid.org/0000-0002-2921-8201
https://orcid.org/0000-0002-2199-0379
work_keys_str_mv AT gaddydavidm tenpairstotagmultilingualpostaggingviacoarsemappingbetweenembeddings
AT zhangyuan tenpairstotagmultilingualpostaggingviacoarsemappingbetweenembeddings
AT barzilayregina tenpairstotagmultilingualpostaggingviacoarsemappingbetweenembeddings
AT jaakkolatommis tenpairstotagmultilingualpostaggingviacoarsemappingbetweenembeddings