Improving parallel data identification using iteratively refined sentence alignments and bilingual mappings of pre-trained language models

From National Research Council Canada

Download	View accepted manuscript: Improving parallel data identification using iteratively refined sentence alignments and bilingual mappings of pre-trained language models (PDF, 382 KiB)
Author	Search for: Lo, Chi-Kiu¹; Search for: Joanis, Eric¹
Affiliation	National Research Council of Canada. Digital Technologies
Format	Text, Article
Conference	The 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 19-20, 2020 [Held Online]
Abstract	The National Research Council of Canada’s team submissions to the parallel corpus filtering task at the Fifth Conference on Machine Translation are based on two key components: (1) iteratively refined statistical sentence alignments for extracting sentence pairs from document pairs and (2) a crosslingual semantic textual similarity metric based on a pretrained multilingual language model, XLMRoBERTa, with bilingual mappings learnt from a minimal amount of clean parallel data for scoring the parallelism of the extracted sentence pairs. The translation quality of the neural machine translation systems trained and fine-tuned on the parallel data extracted by our submissions improved significantly when compared to the organizers’ LASER-based baseline, a sentence-embedding method that worked well last year. For re-aligning the sentences in the document pairs (component 1), our statistical approach has outperformed the current state-of-the-art neural approach in this low-resource context.
Publication date	2020-11-19
Date created	2020-11-30
Publisher	Association for Computational Linguistics
In	5th Conference on Machine Translation (WMT): 970–976.
Language	English
Peer reviewed	Yes
Export citation	Export as RIS
Report a correction	Report a correction (opens in a new tab)
Record identifier	520de843-ed04-446c-83cc-d121adf6aa90
Record created	2020-11-30
Record modified	2020-11-30

Date modified:: 2024-07-22