Imbalanced data stream classification algorithms

2017/27/B/ST6/01325 — funded by Polish National Science Center with budget 613 920 zł

Mw einstein project leader prof. dr hab. inż. Michał Woźniak

Zespół Uczenia Maszynowego

About project

The project "Imbalanced data stream classification algorithms" is an attempt to connect two of the important research trends associated with the design of effective algorithms for data analysis, i.e., data stream classification as well as data with skewed class distribution, because real data streams may exhibit high and varying class imbalance ratio, which can further complicate the classification task.
Only a few of the authors distinguish the differences between imbalanced data stream classification problem and a scenario where the prior knowledge about the entire data set is given. This discrepancy is a result of the lack of the knowledge about the class distribution and this issue is notably present in the in the initial stages of data stream classification.
Another difficulty is the presence of the phenomenon called (concept drift), what can usually lead to the classifier quality deterioration. The concept drift may have different nature, but it causes the change of the of probability characteristics of the decision task, e.g., it could lead to a change of the prior probabilities, i.e., the frequency at which the objects appear in the examined classes. A typical example of such a case is technical diagnosis in which the fault probability increases with utilization time and it may be a result of material fatigue. Sometimes the relationship between the minority and majority classes changes in a way that the former becomes the majority class. This phenomenon can also be observed in tasks related to social media analysis or environmental hazards detection system, as oil spill detection. It is worth also mentioning medical screening for a condition is usually performed on a large population of people without the condition, to detect a small minority with it (e.g., HIV prevalence in the USA is ca. 0.4%) or the conversion rates of online ads has been estimated to lie between 0.00001 and 0.001 (according to T.Fawcett, Learning from Imbalanced Classes, 25th August 2017}.
Additionally, in incremental learning, %where the model is updated after receiving a full data set,
if the majority-class objects outnumber greatly the minority class, the latter can be completely ignored. The aforementioned issues are the reason why using most methods for imbalanced data classification is restricted to the offline learning only, i.e., a case where the entire data set is provided prior to the analysis.

During the project the following research hypothesis is proposed

Data stream classifiers trained on the basis of learning methods taking into consideration data imbalance can outperform classifiers trained on the basis of algorithms which do not take this characteristic into consideration.

Conducted literature study indicates the need to develop imbalanced data stream classification methods with special attention to:
* Methods for determining the imbalance ratio since most of the existing algorithms assume its prior knowledge.
* Non-stationary data stream classification methods due to most works assuming the stationarity of the streams, ignoring the \emph{concept drift}.
* Reducing the memory complexity of imbalanced data stream classification model due to some approaches assuming that the entirety of the forthcoming minority class objects are stored in memory.
* Dedicated methods for data pre-processing.
* Non-stationary data stream classification methods that are not based on the paradigm of classifier ensemble.
* Methods using active learning for imbalanced data stream classifiers due to only few works in this area of research.

The project will focus on the following research tasks:
* Developing classifiers for learning from stationary imbalanced data streams.
* Developing classifiers for learning from drifting imbalanced data streams.
* Creating an open-source software library for imbalanced data stream classification. The programming library will be designed and implemented using a technology and environment of our choosing that will be later used to create a computer experimentation system. The implementation code of the algorithms included in the programming library will be free available according to the open source license.
* Evaluation of the proposed classification algorithms. An attempts at analytical evaluation of the designed methods will be made. However because the assessment is often limited or impossible the evaluation of the characteristics of the methods will be conducted on the basis of a computer experiment. %An attempt at creating dedicated quality metrics for imbalanced data streams.


Zyblewski, Pawel
Wozniak, Michal
Novel clustering-based pruning algorithms 
in Pattern Analysis and Applications, 2020
Zgraja, Jakub
Moulton, Richard Hugh
Gama, Jo~ao
Kasprzak, Andrzej
Wozniak, Michal
Adapting ClusTree for more challenging data stream environments 
in Journal of Intelligent and Fuzzy Systems, 2019
Lu, Lin
Wo'zniak, Michal
Chora's, Michal
Chora's, Ryszard S.
Imbalanced Data Classification Using Weighted Voting Ensemble 
Springer International Publishing 2020
Klikowski, Jakub
Wo'zniak, Michal
Burduk, Robert
Kurzynski, Marek
Wozniak, Michal
Multi Sampling Random Subspace Ensemble for Imbalanced Data Stream Classification 
Springer International Publishing 2020
Zyblewski, Pawel
Wo'zniak, Michal
P'erez Garc'ia, Hilde
S'anchez Gonz'alez, Lidia
Castej'on Limas, Manuel
Quinti'an Pardo, H'ector
Corchado Rodr'iguez, Emilio
Clustering-Based Ensemble Pruning and Multistage Organization Using Diversity 
Springer International Publishing 2019
Zyblewski, Pawel
Ksieniewicz, Pawel
Wo'zniak, Michal
Rutkowski, Leszek
Scherer, Rafal
Korytkowski, Marcin
Pedrycz, Witold
Tadeusiewicz, Ryszard
Zurada, Jacek M.
Classifier Selection for Highly Imbalanced Data Streams with Minority Driven Ensemble 
Springer International Publishing 2019
Krawczyk, Bartosz
Wozniak, Michal
Rodrigues, Jo~ao M. F.
Cardoso, Pedro J. S.
Monteiro, J^anio
Lam, Roberto
Krzhizhanovskaya, Valeria V.
Lees, Michael H.
Dongarra, Jack J.
Sloot, Peter M.A.
On the Role of Cost-Sensitive Learning in Imbalanced Data Oversampling 
Springer International Publishing 2019
Klikowski, Jakub
Ksieniewicz, Pawel
Wo'zniak, Michal
Yin, Hujun
Camacho, David
Tino, Peter
Tall'on-Ballesteros, Antonio J.
Menezes, Ronaldo
Allmendinger, Richard
A Genetic-Based Ensemble Learning Applied to Imbalanced Data Classification 
Springer International Publishing 2019