National Research Council of Canada. Information and Communication Technologies
First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, August 23-29, 2014, Dublin, Ireland
We describe the system built by the National Research Council Canada for the ”Discriminating between similar languages” (DSL) shared task. Our system uses various statistical classifiers and makes predictions based on a two-stage process: we first predict the language group, then discriminate between languages or variants within the group. Language groups are predicted using a generative classifier with 99.99% accuracy on the five target groups. Within each group (except English), we use a voting combination of discriminative classifiers trained on a variety of feature spaces, achieving an average accuracy of 95.71%, with per-group accuracy between 90.95% and 100% depending on the group. This approach turns out to reach the best performance among all systems submitted to the open and closed tasks.
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages: 139–145.