Beyond the boundaries of SMOTE: a framework for manifold-based synthetically oversampling

Par Conseil national de recherches du Canada

Auteur	Rechercher : Bellinger, Colin¹; Rechercher : Drummond, Christopher²; Rechercher : Japkowicz, Nathalie
Affiliation	Conseil national de recherches du Canada. Institut de recherche aérospatiale du CNRC Conseil national de recherches du Canada. Technologies de l'information et des communications
Format	Texte, Article
Conférence	Joint European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2016, September 19-23, 2016, Riva del Garda, Italy
Sujet	machine learning; class imbalance; synthetic oversampling; manifold and embeddings
Résumé	Problems of class imbalance appear in diverse domains, ranging from gene function annotation to spectra and medical classification. On such problems, the classifier becomes biased in favour of the majority class. This leads to inaccuracy on the important minority classes, such as specific diseases and gene functions. Synthetic oversampling mitigates this by balancing the training set, whilst avoiding the pitfalls of random under and oversampling. The existing methods are primarily based on the SMOTE algorithm, which employs a bias of randomly generating points between nearest neighbours. The relationship between the generative bias and the latent distribution has a significant impact on the performance of the induced classifier. Our research into gamma-ray spectra classification has shown that the generative bias applied by SMOTE is inappropriate for domains that conform to the manifold property, such as spectra, text, image and climate change classification. To this end, we propose a framework for manifold-based synthetic oversampling, and demonstrate its superiority in terms of robustness to the manifold with respect to the AUC on three spectra classification tasks and 16 UCI datasets.
Date de publication	2016
Maison d’édition	Springer
Dans	Machine Learning and Knowledge Discovery in Databases : 248–263.
Série	Lecture Notes in Computer Science.
Langue	anglais
Publications évaluées par des pairs	Oui
Numéro NPARC	23002088
Exporter la notice	Exporter en format RIS
Signaler une correction	Signaler une correction (s'ouvre dans un nouvel onglet)
Identificateur de l’enregistrement	b1787f39-6e92-4586-8155-c85442a2d7c2
Enregistrement créé	2017-08-10
Enregistrement modifié	2020-03-16

Date de modification :: 2025-04-03