Effect of imbalanced data on document classification algorithms

aut.embargoNoen_NZ
aut.thirdpc.containsNoen_NZ
aut.thirdpc.permissionNoen_NZ
aut.thirdpc.removedNoen_NZ
dc.contributor.advisorNand, Parma
dc.contributor.authorPaul, Amrita
dc.date.accessioned2014-07-08T04:09:31Z
dc.date.available2014-07-08T04:09:31Z
dc.date.copyright2014
dc.date.created2014
dc.date.issued2014
dc.date.updated2014-07-08T03:32:41Z
dc.description.abstractText classification is the task of assigning predefined categories to free text documents. Due to the ever-increasing amount of electronic documents, digital libraries and web resources, document classification is critical in higher level document processing tasks such as information extraction, named entity recognition and event modelling. Text categorization is considered to be challenging because of the large number of features in a typical text document. In spite of this, various categorization algorithms have reached accuracies in the vicinity of 90%. It has generally been found that probability based algorithms perform better on Natural Language Processing tasks compared to other types of algorithms. This is in addition to probabilistic algorithms being highly extensible. In this thesis paper, a tool called MALLET (MAchine Learning for LanguagE Toolkit) was used to perform document classification using a set of probabilistic algorithms to determine the effect of imbalanced data on the performance of these algorithms when compared to balanced data. The data used for the research was taken from Reuters Corpus (RCV1) which contains categorized newspaper articles. Although the corpus contains many fine levels of categorization, this research used four upper level topic codes which were further organized into binary categories of a document belonging to a category or out of it. The documents were then converted into a form acceptable to MALLET and tested for categorization with the chosen algorithms. The algorithms used for the research were Naïve Bayes, Balanced Winnow and three variations of Max Ent, namely Max Ent, Max Ent L1 and MC Max Ent. It was firstly found that these probability based algorithms performed marginally better than other algorithms reported in previous works on similar genre of input data. However, a significant finding from the research was that the algorithms performed similarly or in some cases even better, for imbalanced data compared to balanced data. This was due to the vocabulary properties of the documents used for training and asserts the resilience of the probability based algorithms for text categorization.en_NZ
dc.identifier.urihttps://hdl.handle.net/10292/7413
dc.language.isoenen_NZ
dc.publisherAuckland University of Technology
dc.rights.accessrightsOpenAccess
dc.subjectClassifyen_NZ
dc.subjectTextsen_NZ
dc.titleEffect of imbalanced data on document classification algorithmsen_NZ
dc.typeThesis
thesis.degree.discipline
thesis.degree.grantorAuckland University of Technology
thesis.degree.levelMasters Theses
thesis.degree.nameMaster of Computer and Information Sciencesen_NZ
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
PaulA.pdf
Size:
1.57 MB
Format:
Adobe Portable Document Format
Description:
Whole thesis
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
897 B
Format:
Item-specific license agreed upon to submission
Description:
Collections