Experiments in discriminating similar languages

Download	View accepted manuscript: Experiments in discriminating similar languages (PDF, 531 KiB)
Author	Search for: Goutte, Cyril¹; Search for: Léger, Serge¹
Affiliation	National Research Council of Canada. Information and Communication Technologies
Format	Text, Article
Conference	LT4VarDial - Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, September 10th, 2015, Hissar, Bulgaria
Abstract	We describe the system built by the National Research Council (NRC) Canada for the 2015 shared task on Discriminating between similar languages. The NRC system uses various statistical classifiers trained on character and word ngram features. Predictions rely on a two-stage process: we first predict the language group, then discriminate between languages or variants within the group. This year, we focused on two issues: 1) the ngram generation process, and 2) the handling of the anonymized (“blinded”) Named Entities. Despite the slightly harder experimental conditions this year, our systems achieved an average accuracy of 95.24% (closed task) and 95.65% (open task), ending up second or (close) third on the closed task, and first on the open task.
Publication date	2015-09
In	Proceedings of LT4VarDial (September 2015).
Language	English
Peer reviewed	Yes
NPARC number	21276326
Export citation	Export as RIS
Report a correction	Report a correction (opens in a new tab)
Record identifier	884cac9b-7d70-4078-9542-3f5980852d99
Record created	2015-10-02
Record modified	2020-06-04