Learning machine translation from in-domain and out-of-domain data

Par Conseil national de recherches du Canada

Auteur	Rechercher : Turchi, Marco; Rechercher : Goutte, Cyril¹; Rechercher : Cristianini, Nello
Affiliation	Conseil national de recherches du Canada. Technologies de l'information et des communications
Format	Texte, Article
Conférence	16th Annual Conference of the European Association for Machine Translation (EAMT), 28-30 May 2012, Trento, Italy
Résumé	The performance of Phrase-Based Statistical Machine Translation (PBSMT) systems mostly depends on training data. Many papers have investigated how to create new resources in order to increase the size of the training corpus in an attempt to improve PBSMT performance. In this work, we analyse and characterize the way in which the in-domain and outof- domain performance of PBSMT is impacted when the amount of training data increases. Two different PBSMT systems, Moses and Portage, two of the largest parallel corpora, Giga (French-English) and UN (Chinese-English) datasets and several in- and out-of-domain test sets were used to build high quality learning curves showing consistent logarithmic growth in performance. These results are stable across language pairs, PBSMT systems and domains. We also analyse the respective impact of additional training data for estimating the language and translation models. Our proposed model approximates learning curves very well and indicates the translation model contributes about 30% more to the performance gain than the language model.
Date de publication	2012-05
Dans	Proceedings of the 16th Annual Conference of the European Association for Machine Translation (mai 2012) : 305–312.
Langue	anglais
Publications évaluées par des pairs	Oui
Numéro NPARC	21268098
Exporter la notice	Exporter en format RIS
Signaler une correction	Signaler une correction (s'ouvre dans un nouvel onglet)
Identificateur de l’enregistrement	b9e5bc8b-b13a-4964-96b6-ac4a29c17a65
Enregistrement créé	2013-04-09
Enregistrement modifié	2020-04-21

Date de modification :: 2024-07-23