Measuring sentence parallelism using Mahalanobis distances: the NRC unsupervised submissions to the WMT18 Parallel Corpus Filtering shared task

Download	View final version: Measuring sentence parallelism using Mahalanobis distances: the NRC unsupervised submissions to the WMT18 Parallel Corpus Filtering shared task (PDF, 252 KiB)
DOI	Resolve DOI: https://doi.org/10.18653/v1/W18-6481
Author	Search for: Littell, Patrick¹; Search for: Larkin, Samuel¹; Search for: Stewart, Darlene¹; Search for: Simard, Michel¹; Search for: Goutte, Cyril¹; Search for: Lo, Chi-Kiu¹
Affiliation	National Research Council Canada. Digital Technologies
Format	Text, Article
Conference	The Third Conference on Machine Translation (WMT 18), Oct. 31 - Nov.1, 2018, Brussels, Belgium
Abstract	The WMT18 shared task on parallel corpus filtering (Koehn et al., 2018b) challenged teams to score sentence pairs from a large high recall, low-precision web-scraped parallel corpus (Koehn et al., 2018a). Participants could use existing sample corpora (e.g. past WMT data) as a supervisory signal to learn what a “clean” corpus looks like. However, in lower resource situations it often happens that the target corpus of the language is the only sample of parallel text in that language. We therefore made several unsupervised entries, setting ourselves an additional constraint that we not utilize the additional clean parallel corpora. One such entry fairly consistently scored in the top ten systems in the 100M-word conditions, and for one task—translating the European Medicines Agency corpus (Tiedemann, 2009)—scored among the best systems even in the 10M-word conditions.
Publication date	2018-11-01
Publisher	Association for Computational Linguistics (ACL)
In	Proceedings of the Third Conference on Machine Translation: Shared Task Papers 2: 913–920.
Language	English
Peer reviewed	Yes
Export citation	Export as RIS
Report a correction	Report a correction (opens in a new tab)
Record identifier	c7a0017b-bde5-4154-90be-93d0a454b094
Record created	2019-04-08
Record modified	2020-05-30

Page details

From:

National Research Council Canada

Date modified:: 2026-04-20