Performing Sentiment Analysis on Large Email Text Data
MetadataShow full metadata
Companies and organizations all over the world aims to progress and prosper and anyone who wishes so is expected to know about the current progress of the company which can be got from live data. One such live data is found via email data. By analysing email data which comprises chains of conversation between the employees of the company and clients, one can make a judgment as to how well the progress is. But to perform analysis on such large data is tiresome, time consuming and prone to error if done manually. Sentiment analysis which is a domain under Natural Language Processing is a concept which can address this issue. Using Sentiment analysis, we can make such a judgment about the progress of the company or organization. The purpose of this thesis or research work is to bring out the most efficient and best algorithm to perform sentiment analysis on large data set comprising email data with the best precision. This thesis throws light on understanding the basic concepts of sentiment analysis and then showcases a model which performs sentiment analysis on an email data set. Drawbacks of the current model are observed and either an improvement is made to it or a new model is developed to address those drawbacks. Every new model features something new either in terms of handling the data or making use of better classification algorithms and giver better performance values compared to the previous model. The performance is measured in terms of precision, recall and accuracy. In the thesis, an algorithm is demonstrated to show how sentiment analysis is performs where supervised learning is made use of. The next model is built using this model which makes use of a larger email data set. The first model uses a simple K-nearest neighbours classifier to give us the performance measures. The next few models are built to improve the values by using different classifiers and new features such as Named Entity Recognition and Vectorization. In order to achieve greater values, a model was implemented using Artificial Neural Networks and its derivatives like LSTM. Finally, a domain agnostic model built using the concept of bidirectional LSTM gave the best values and this is the model that is presented as the best. The model also has a few features implemented like Word2vec embedding and Dask to improve the efficiency during run time. The literature survey section shows how researching about work conducted by others in the same domain enabled me to come up with the models. The thesis shows an experimental quantitative approach where models are experimented with and a better model is prepared to improve the performance measures. A section is also presented to explain the various concepts, algorithms and formulas used. The thesis concludes by showing the best model to perform sentiment analysis on the large data set and why it is the best. The advantages and strengths of the model are discussed.