Effect of Imbalanced Data on Document Classification Algorithms

Paul, Amrita

Effect of Imbalanced Data on Document Classification Algorithms

aut.embargo	No	en_NZ
aut.thirdpc.contains	No	en_NZ
aut.thirdpc.permission	No	en_NZ
aut.thirdpc.removed	No	en_NZ
dc.contributor.advisor	Nand, Parma
dc.contributor.author	Paul, Amrita
dc.date.accessioned	2014-07-08T04:09:31Z
dc.date.available	2014-07-08T04:09:31Z
dc.date.copyright	2014
dc.date.created	2014
dc.date.issued	2014
dc.date.updated	2014-07-08T03:32:41Z
dc.description.abstract	Text classification is the task of assigning predefined categories to free text documents. Due to the ever-increasing amount of electronic documents, digital libraries and web resources, document classification is critical in higher level document processing tasks such as information extraction, named entity recognition and event modelling. Text categorization is considered to be challenging because of the large number of features in a typical text document. In spite of this, various categorization algorithms have reached accuracies in the vicinity of 90%. It has generally been found that probability based algorithms perform better on Natural Language Processing tasks compared to other types of algorithms. This is in addition to probabilistic algorithms being highly extensible. In this thesis paper, a tool called MALLET (MAchine Learning for LanguagE Toolkit) was used to perform document classification using a set of probabilistic algorithms to determine the effect of imbalanced data on the performance of these algorithms when compared to balanced data. The data used for the research was taken from Reuters Corpus (RCV1) which contains categorized newspaper articles. Although the corpus contains many fine levels of categorization, this research used four upper level topic codes which were further organized into binary categories of a document belonging to a category or out of it. The documents were then converted into a form acceptable to MALLET and tested for categorization with the chosen algorithms. The algorithms used for the research were Naïve Bayes, Balanced Winnow and three variations of Max Ent, namely Max Ent, Max Ent L1 and MC Max Ent. It was firstly found that these probability based algorithms performed marginally better than other algorithms reported in previous works on similar genre of input data. However, a significant finding from the research was that the algorithms performed similarly or in some cases even better, for imbalanced data compared to balanced data. This was due to the vocabulary properties of the documents used for training and asserts the resilience of the probability based algorithms for text categorization.	en_NZ
dc.identifier.uri	https://hdl.handle.net/10292/7413
dc.language.iso	en	en_NZ
dc.publisher	Auckland University of Technology
dc.rights.accessrights	OpenAccess
dc.subject	Classify	en_NZ
dc.subject	Texts	en_NZ
dc.title	Effect of Imbalanced Data on Document Classification Algorithms	en_NZ
dc.type	Thesis
thesis.degree.discipline
thesis.degree.grantor	Auckland University of Technology
thesis.degree.level	Masters Theses
thesis.degree.name	Master of Computer and Information Sciences	en_NZ

Files

Original bundle

Now showing 1 - 1 of 1

Name:: PaulA.pdf
Size:: 1.57 MB
Format:: Adobe Portable Document Format
Description:: Whole thesis

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 897 B
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Masters Theses