Feature Selection, Classification and Knowledge Discovery for Minority Class Imbalanced Data

Jansari, Vinita GangaramFeature Selection, Classification and Knowledge Discovery for Minority Class Imbalanced DataAuckland University of Technology2023My UniversityMy UniversityFlackett, JohnNand, Parma2024-01-082024-01-082023enThesishttp://hdl.handle.net/10292/17073OpenAccessWe are in the era of rapid data generation and analysis using machine learning and artificial intelligence techniques. A crucial challenge with numerous datasets is the underlying imbalance in the data. Data mining using imbalanced datasets continues to be a challenge after years of research. The biomedical domain is one of the domains with a prevalence of class imbalance in the data. Within biomedical domain it is essential to obtain maximum correct predictions on the minority class (disease class or the class of interest) without reducing the predictions on the majority class. In biomedical datasets, a diagnosis wrongly indicating the presence of a disease is as significant as a diagnosis wrongly indicating the absence of a disease. Biomedical datasets with a large number of features are also high-dimensional, and removing redundant and irrelevant features from the data is essential before performing classification or prediction. Even though a significant amount of research has been undertaken in the field of imbalanced classification, the most widely used approaches for mining imbalanced data artificially change the dataset to make the dataset balanced. Limited amount of research has been undertaken where the data is high dimensional, imbalanced, and there is overlap in the data. A novel probability based feature pruning classifier named, Positive confidence based feature pruning classifier (Pconf-FPC) is proposed in this thesis to increase the number of correct classifications/predictions on the minority class data and automatically reduce the overlap and the dimensions in the data. Pconf-FPC utilises the underlying characteristics of the data to assign a confidence value to each sample, signifying the probability of the sample belonging to the positive class (minority class in our case). The performance of Pconf-FPC is tested on publicly available datasets and a case study dataset. Pconf-FPC can improve the F2 score of the minority class by 10% compared to widely used techniques like Support Vector Machines and Random Forests. The performance of Pconf-FPC is compared to SMOTE and AdaBoost, and Pconf-FPC is able to provide higher weighted F2 score for test data for three datasets. Additionally, Pconf-FPC reduces the degree of overlap for two publicly available datasets and the case study dataset after performing feature pruning.