Concept-based explanations to test for false causal relationships learned by abusive language classifiers

Download	View final version: Concept-based explanations to test for false causal relationships learned by abusive language classifiers (PDF, 792 KiB)
DOI	Resolve DOI: https://doi.org/10.18653/v1/2023.woah-1.14
Author	Search for: Nejadgholi, Isar¹; Search for: Kiritchenko, Svetlana¹; Search for: Fraser, Kathleen C.¹; Search for: Balkir, Esma¹
Affiliation	National Research Council Canada. Digital Technologies
Funder	Search for: National Research Council Canada. Digital Privacy and Security Program
Format	Text, Article
Conference	The 7th Workshop on Online Abuse and Harms (WOAH), July 13, 2023, Toronto, Ontario
Abstract	Classifiers tend to learn a false causal relationship between an over-represented concept and a label, which can result in over-reliance on the concept and compromised classification accuracy. It is imperative to have methods in place that can compare different models and identify over-reliances on specific concepts. We consider three well-known abusive language classifiers trained on large English datasets and focus on the concept of negative emotions, which is an important signal but should not be learned as a sufficient feature for the label of abuse. Motivated by the definition of global sufficiency, we first examine the unwanted dependencies learned by the classifiers by assessing their accuracy on a challenge set across all decision thresholds. Further, recognizing that a challenge set might not always be available, we introduce concept-based explanation metrics to assess the influence of the concept on the labels. These explanations allow us to compare classifiers regarding the degree of false global sufficiency they have learned between a concept and a label.
Publication date	2023-07-13
Publisher	Association for Computational Linguistics
Licence	Creative Commons Attribution 4.0 International (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/
In	The 7th Workshop on Online Abuse and Harms (WOAH), 2023.woah-1.14 (13 July 2023): 138–149.
Language	English
Peer reviewed	Yes
Export citation	Export as RIS
Report a correction	Report a correction (opens in a new tab)
Record identifier	b9e78f14-9675-41a6-b8ae-c205d6ba8b98
Record created	2023-07-17
Record modified	2025-12-18

Page details

From:

National Research Council Canada

Date modified:: 2026-04-20