Téléchargement | - Voir la version finale : On cross-dataset generalization in automatic detection of online abuse (PDF, 679 Kio)
- Voir le manuscrit accepté : On cross-dataset generalization in automatic detection of online abuse (PDF, 641 Kio)
|
---|
Auteur | Rechercher : Nejadgholi, Isar1; Rechercher : Kiritchenko, Svetlana1 |
---|
Affiliation | - Conseil national de recherches du Canada. Technologies numériques
|
---|
Format | Texte, Article |
---|
Conférence | WOAH 2020 - Fourth Workshop on Online Abuse and Harms, November 20th, 2020 - [Held Online] |
---|
Résumé | NLP research has attained high performances in abusive language detection as a supervised classification task. While in research settings, training and test datasets are usually obtained from similar data samples, in practice systems are often applied on data that are different from the training set in topic and class distributions. Also, the ambiguity in class definitions inherited in this task aggravates the discrepancies between source and target datasets. We explore the topic bias and the task formulation bias in cross-dataset generalization. We show that the benign examples in the Wikipedia Detox dataset are biased towards platformspecific topics. We identify these examples using unsupervised topic modeling and manual inspection of topics’ keywords. Removing these topics increases cross-dataset generalization, without reducing in-domain classification performance. For a robust dataset design, we suggest applying inexpensive unsupervised methods to inspect the collected data and downsize the non-generalizable content before manually annotating for class labels. |
---|
Date de publication | 2020-11-20 |
---|
Date de création | 2020-11-30 |
---|
Maison d’édition | Association for Computational Linguistics |
---|
Licence | |
---|
Dans | |
---|
Langue | anglais |
---|
Publications évaluées par des pairs | Oui |
---|
Exporter la notice | Exporter en format RIS |
---|
Signaler une correction | Signaler une correction (s'ouvre dans un nouvel onglet) |
---|
Identificateur de l’enregistrement | 846aa815-b9d6-4b34-9501-9163df950d7a |
---|
Enregistrement créé | 2020-11-30 |
---|
Enregistrement modifié | 2020-12-02 |
---|