DOI | Resolve DOI: https://doi.org/10.1007/978-3-030-18305-9_16 |
---|
Author | Search for: Pagotto, Andrea1; Search for: Littell, Patrick1; Search for: Wang, Yunli1; Search for: Goutte, Cyril1 |
---|
Affiliation | - National Research Council of Canada. Digital Technologies
|
---|
Format | Text, Article |
---|
Conference | 32nd Canadian Conference on Artificial Intelligence, Canadian AI 2019, May 28–31, 2019, Kingston, ON, Canada |
---|
Subject | parallel corpus; misalignement; change point detection |
---|
Abstract | Parallel corpora are the basic resource for many multilingual natural language processing models. Recent advances in, e.g. neural machine translation have shown that the quality of the alignment in the corpus has a crucial impact on the quality of the resulting model, renewing interest in filtering automatically aligned corpora to increase their quality. In this contribution, we investigate the use of a fast change point detection method to detect possibly problematic parts of a parallel corpus. We demonstrate its performance on German-English corpora of 11k and 31k sentences, achieve a boundary identification performance above 80% and improve the detection of genuine parallel sentences up to 88%. To our knowledge this is the first application of change point detection to the problem of error detection in noisy corpora. |
---|
Publication date | 2019-04-24 |
---|
Publisher | Springer |
---|
In | |
---|
Series | |
---|
Language | English |
---|
Peer reviewed | Yes |
---|
Export citation | Export as RIS |
---|
Report a correction | Report a correction (opens in a new tab) |
---|
Record identifier | 91bf3e6a-e23f-4d7b-892a-21b104531d59 |
---|
Record created | 2021-03-17 |
---|
Record modified | 2021-03-17 |
---|