DOI | Trouver le DOI : https://doi.org/10.1007/978-3-030-18305-9_16 |
---|
Auteur | Rechercher : Pagotto, Andrea1; Rechercher : Littell, Patrick1; Rechercher : Wang, Yunli1; Rechercher : Goutte, Cyril1 |
---|
Affiliation | - Conseil national de recherches du Canada. Technologies numériques
|
---|
Format | Texte, Article |
---|
Conférence | 32nd Canadian Conference on Artificial Intelligence, Canadian AI 2019, May 28–31, 2019, Kingston, ON, Canada |
---|
Sujet | parallel corpus; misalignement; change point detection |
---|
Résumé | Parallel corpora are the basic resource for many multilingual natural language processing models. Recent advances in, e.g. neural machine translation have shown that the quality of the alignment in the corpus has a crucial impact on the quality of the resulting model, renewing interest in filtering automatically aligned corpora to increase their quality. In this contribution, we investigate the use of a fast change point detection method to detect possibly problematic parts of a parallel corpus. We demonstrate its performance on German-English corpora of 11k and 31k sentences, achieve a boundary identification performance above 80% and improve the detection of genuine parallel sentences up to 88%. To our knowledge this is the first application of change point detection to the problem of error detection in noisy corpora. |
---|
Date de publication | 2019-04-24 |
---|
Maison d’édition | Springer |
---|
Dans | |
---|
Série | |
---|
Langue | anglais |
---|
Publications évaluées par des pairs | Oui |
---|
Exporter la notice | Exporter en format RIS |
---|
Signaler une correction | Signaler une correction (s'ouvre dans un nouvel onglet) |
---|
Identificateur de l’enregistrement | 91bf3e6a-e23f-4d7b-892a-21b104531d59 |
---|
Enregistrement créé | 2021-03-17 |
---|
Enregistrement modifié | 2021-03-17 |
---|