Learning from Imbalanced Data
The problem of data imbalance is related to the uneven distribution of the training examples to the involved classes. In such cases, the vast majority of the input samples are associated with just one class (the majority class), whereas the rest of the classes are significantly underrepresented (minority classes). Nowadays, a broad variety of application areas suffer from class imbalance, including Cybersecurity, Bioinformatics, Natural Language Processing, Image and Multimedia data processing, and so on.
Using such data to train machine learning models is almost always problematic, since the produced models are strongly biased towards the majority class and cannot learn the minority classes sufficiently. As a consequence, the accuracy and the generalization capabilities of these models are significantly degraded.
In this tutorial the most modern advances in the field of classification with imbalanced data will be presented. The underlying techniques will be categorized according to the approach that they apply to confront the problem. The entire family of resampling techniques (over-sampling, under-sampling, hybrid-sampling, etc.) will be reviewed in details. In the sequel, algorithm-based approaches and cost-sensitive learning methods will be analyzed. The second part will summarize the current conclusions and it will include an inspiring description of the most important challenges in the area which that are still left open. Finally, some insights for the ongoing and future research will be discussed.
Using such data to train machine learning models is almost always problematic, since the produced models are strongly biased towards the majority class and cannot learn the minority classes sufficiently. As a consequence, the accuracy and the generalization capabilities of these models are significantly degraded.
In this tutorial the most modern advances in the field of classification with imbalanced data will be presented. The underlying techniques will be categorized according to the approach that they apply to confront the problem. The entire family of resampling techniques (over-sampling, under-sampling, hybrid-sampling, etc.) will be reviewed in details. In the sequel, algorithm-based approaches and cost-sensitive learning methods will be analyzed. The second part will summarize the current conclusions and it will include an inspiring description of the most important challenges in the area which that are still left open. Finally, some insights for the ongoing and future research will be discussed.