An interpretable measure of dataset complexity for imbalanced classification problems

DOI	Resolve DOI: https://doi.org/10.1137/1.9781611977653.ch29
Author	Search for: Gøttcke, Jonatan Møller Nuutinen; Search for: Bellinger, Colin¹; Search for: Branco, Paula; Search for: Zimek, Arthur
Affiliation	National Research Council Canada. Digital Technologies
Format	Text, Article
Conference	2023 SIAM International Conference on Data Mining, April 27-29, Minneapolis, MN, U.S.
Abstract	The class imbalance problem is associated with harmful classification bias and presents itself in a wide variety of important applications of supervised machine learning. Measures have been developed to determine the imbalance complexity of datasets with imbalanced classes. The most common such measure is the Imbalance Ratio (IR). It is, however, widely accepted that the complexity of a classification task is the combined result of class imbalance and other factors, such as class overlap. Thus, in order to accurately assess the complexity of a problem, the data complexity measures ought to account for more than the simple IR. In this paper, we demonstrate that IR has a weak correlation with classifier performance in terms of macro averaged recall, gmean score, and precision. Other more complete measures such as the adapted N1 and N3 measures use neighborhood information to assess overlap. These measures show a strong negative correlation with classifier performance, but their reported values were hard to interpret. This motivates a new measure that estimates overlap complexity and returns a value with a clear interpretation. Here we propose such a measure based on the number of minority instances entangled in a Tomek Link. The proposed measure is evaluated on a large selection of synthetic and real datasets and is found to be as good as or better than the best competitors in terms of its negative correlation with respect to mean classifier performance.
Publication date	2023-04
Publisher	Society for Industrial and Applied Mathematics
In	Proceedings of the 2023 SIAM International Conference on Data Mining (SDM) (April 2023): 253–261.
Language	English
Peer reviewed	Yes
Export citation	Export as RIS
Report a correction	Report a correction (opens in a new tab)
Record identifier	f036f49a-c2aa-450f-a9c4-595272c69fbe
Record created	2023-04-24
Record modified	2023-04-24

Page details

From:

National Research Council Canada

Date modified:: 2026-05-31