Effect of imbalanced data on document classification algorithms
Text classification is the task of assigning predefined categories to free text documents. Due to the ever-increasing amount of electronic documents, digital libraries and web resources, document classification is critical in higher level document processing tasks such as information extraction, named entity recognition and event modelling. Text categorization is considered to be challenging because of the large number of features in a typical text document. In spite of this, various categorization algorithms have reached accuracies in the vicinity of 90%. It has generally been found that probability based algorithms perform better on Natural Language Processing tasks compared to other types of algorithms. This is in addition to probabilistic algorithms being highly extensible.
In this thesis paper, a tool called MALLET (MAchine Learning for LanguagE Toolkit) was used to perform document classification using a set of probabilistic algorithms to determine the effect of imbalanced data on the performance of these algorithms when compared to balanced data. The data used for the research was taken from Reuters Corpus (RCV1) which contains categorized newspaper articles. Although the corpus contains many fine levels of categorization, this research used four upper level topic codes which were further organized into binary categories of a document belonging to a category or out of it. The documents were then converted into a form acceptable to MALLET and tested for categorization with the chosen algorithms.
The algorithms used for the research were Naïve Bayes, Balanced Winnow and three variations of Max Ent, namely Max Ent, Max Ent L1 and MC Max Ent. It was firstly found that these probability based algorithms performed marginally better than other algorithms reported in previous works on similar genre of input data. However, a significant finding from the research was that the algorithms performed similarly or in some cases even better, for imbalanced data compared to balanced data. This was due to the vocabulary properties of the documents used for training and asserts the resilience of the probability based algorithms for text categorization.