C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling

Author	Search for: Drummond, Chris; Search for: Holte, R.C.
Format	Text, Article
Conference	International Conference on Machine Learning (ICML 2003) Workshop on Learning from Imbalanced Data Sets II, July 21, 2003, Washington, DC, USA
Abstract	This paper takes a new look at two sampling schemes commonly used to adapt machine algorithms to imbalanced classes and misclassification costs. It uses a performance analysis technique called cost curves to explore the interaction of over and under-sampling with the decision tree learner C4.5. C4.5 was chosen as, when combined with one of the sampling schemes, it is quickly becoming the community standard when evaluating new cost sensitive learning algorithms. This paper shows that using C4.5 with under-sampling establishes a reasonable standard for algorithmic comparison. But it is recommended that the least cost classifier be part of that standard as it can be better than under-sampling for relatively modest costs. Over-sampling, however, shows little sensitivity, there is often little difference in performance when misclassification costs are changed.
Publication date	2003
In	Proceedings of the International Conference on Machine Learning (ICML 2003) Workshop on Learning from Imbalanced Data Sets II.
Language	English
NRC number	NRCC 47381
NPARC number	5765075
Export citation	Export as RIS
Report a correction	Report a correction (opens in a new tab)
Record identifier	04bc81ac-d061-4ea9-bef4-04a836a682be
Record created	2009-03-29
Record modified	2021-01-05