Data sampling and (In)stability in machine translation evaluation

From National Research Council Canada

Download	View final version: Data sampling and (In)stability in machine translation evaluation (PDF, 294 KiB)
DOI	Resolve DOI: https://doi.org/10.18653/v1/2023.findings-acl.826
Author	Search for: Lo, Chi-Kiu¹; Search for: Knowles, Rebecca¹
Affiliation	National Research Council of Canada. Digital Technologies
Format	Text, Article
Conference	The 61st Annual Meeting of the Association for Computational Linguistics (ACL’23), July 9-14, 2023, Toronto, Ontario, Canada
Abstract	We analyze the different data sampling approaches used in selecting data for human evaluation and ranking of machine translation systems at the highly influential Conference on Machine Translation (WMT). By using automatic evaluation metrics, we are able to focus on the impact of the data sampling procedure as separate from questions about human annotator consistency. We provide evidence that the latest data sampling approach used at WMT skews the annotated data toward shorter documents, not necessarily representative of the full test set. Lastly, we examine a new data sampling method that uses the available labour budget to sample data in a more representative manner, with the goals of improving representation of various document lengths in the sample and producing more stable rankings of system translation quality.
Publication date	2023-07-09
Publisher	Association for Computational Linguistics
Licence	Creative Commons, Attribution 4.0 International (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/
In	Findings of the Association for Computational Linguistics: ACL 2023 (9 July 2023): 13064–13074.
Language	English
Peer reviewed	Yes
Export citation	Export as RIS
Report a correction	Report a correction (opens in a new tab)
Record identifier	bceb07fa-0260-423d-91fc-7e7af550dc8b
Record created	2023-07-17
Record modified	2023-11-02

Date modified:: 2024-07-21