Algorytmy klasyfikacji niezbalansowanych strumieni danych
Kierownik prof. dr hab. inż. Michał Woźniak Kierownik M. Woźniak

Opis projektu

The project "Imbalanced data stream classification algorithms" is an attempt to connect two of the important research trends associated with the design of effective algorithms for data analysis, i.e., data stream classification as well as data with skewed class distribution, because real data streams may exhibit high and varying class imbalance ratio, which can further complicate the classification task.
Only a few of the authors distinguish the differences between imbalanced data stream classification problem and a scenario where the prior knowledge about the entire data set is given. This discrepancy is a result of the lack of the knowledge about the class distribution and this issue is notably present in the in the initial stages of data stream classification.
Another difficulty is the presence of the phenomenon called (concept drift), what can usually lead to the classifier quality deterioration. The concept drift may have different nature, but it causes the change of the of probability characteristics of the decision task, e.g., it could lead to a change of the prior probabilities, i.e., the frequency at which the objects appear in the examined classes. A typical example of such a case is technical diagnosis in which the fault probability increases with utilization time and it may be a result of material fatigue. Sometimes the relationship between the minority and majority classes changes in a way that the former becomes the majority class. This phenomenon can also be observed in tasks related to social media analysis or environmental hazards detection system, as oil spill detection. It is worth also mentioning medical screening for a condition is usually performed on a large population of people without the condition, to detect a small minority with it (e.g., HIV prevalence in the USA is ca. 0.4%) or the conversion rates of online ads has been estimated to lie between 0.00001 and 0.001 (according to T.Fawcett, Learning from Imbalanced Classes, 25th August 2017}.
Additionally, in incremental learning, %where the model is updated after receiving a full data set,
if the majority-class objects outnumber greatly the minority class, the latter can be completely ignored. The aforementioned issues are the reason why using most methods for imbalanced data classification is restricted to the offline learning only, i.e., a case where the entire data set is provided prior to the analysis.

During the project the following research hypothesis is proposed

Data stream classifiers trained on the basis of learning methods taking into consideration data imbalance can outperform classifiers trained on the basis of algorithms which do not take this characteristic into consideration.

Conducted literature study indicates the need to develop imbalanced data stream classification methods with special attention to:
* Methods for determining the imbalance ratio since most of the existing algorithms assume its prior knowledge.
* Non-stationary data stream classification methods due to most works assuming the stationarity of the streams, ignoring the \emph{concept drift}.
* Reducing the memory complexity of imbalanced data stream classification model due to some approaches assuming that the entirety of the forthcoming minority class objects are stored in memory.
* Dedicated methods for data pre-processing.
* Non-stationary data stream classification methods that are not based on the paradigm of classifier ensemble.
* Methods using active learning for imbalanced data stream classifiers due to only few works in this area of research.

The project will focus on the following research tasks:
* Developing classifiers for learning from stationary imbalanced data streams.
* Developing classifiers for learning from drifting imbalanced data streams.
* Creating an open-source software library for imbalanced data stream classification. The programming library will be designed and implemented using a technology and environment of our choosing that will be later used to create a computer experimentation system. The implementation code of the algorithms included in the programming library will be free available according to the open source license.
* Evaluation of the proposed classification algorithms. An attempts at analytical evaluation of the designed methods will be made. However because the assessment is often limited or impossible the evaluation of the characteristics of the methods will be conducted on the basis of a computer experiment. %An attempt at creating dedicated quality metrics for imbalanced data streams.

Publikacje projektu IDSTREAM