Download | - View accepted manuscript: Improving parallel data identification using iteratively refined sentence alignments and bilingual mappings of pre-trained language models (PDF, 382 KiB)
|
---|
Author | Search for: Lo, Chi-Kiu1; Search for: Joanis, Eric1 |
---|
Affiliation | - National Research Council of Canada. Digital Technologies
|
---|
Format | Text, Article |
---|
Conference | The 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 19-20, 2020 [Held Online] |
---|
Abstract | The National Research Council of Canada’s team submissions to the parallel corpus filtering task at the Fifth Conference on Machine Translation are based on two key components: (1) iteratively refined statistical sentence alignments for extracting sentence pairs from document pairs and (2) a crosslingual semantic textual similarity metric based on a pretrained multilingual language model, XLMRoBERTa, with bilingual mappings learnt from a minimal amount of clean parallel data for scoring the parallelism of the extracted sentence pairs. The translation quality of the neural machine translation systems trained and fine-tuned on the parallel data extracted by our submissions improved significantly when compared to the organizers’ LASER-based baseline, a sentence-embedding method that worked well last year. For re-aligning the sentences in the document pairs (component 1), our statistical approach has outperformed the current state-of-the-art neural approach in this low-resource context. |
---|
Publication date | 2020-11-19 |
---|
Date created | 2020-11-30 |
---|
Publisher | Association for Computational Linguistics |
---|
In | |
---|
Language | English |
---|
Peer reviewed | Yes |
---|
Export citation | Export as RIS |
---|
Report a correction | Report a correction (opens in a new tab) |
---|
Record identifier | 520de843-ed04-446c-83cc-d121adf6aa90 |
---|
Record created | 2020-11-30 |
---|
Record modified | 2020-11-30 |
---|