A predictive model to detect online cyberbullying

Kasture, Abhijeet Sudhir
Nand, Parma
Tegginmath, Shoba
Item type
Degree name
Master of Computer and Information Sciences
Journal Title
Journal ISSN
Volume Title
Auckland University of Technology

Cyberbullying is prevalent in most countries across the globe. The aim of this research was to develop a predictive model to identify the occurrence of cyberbullying tweets on Twitter. The paradigm shift in the Internet of Things was observed a decade ago, which resulted in enormous growth in the number of active Internet users. Today, this number has exceeded three billion. Social networking websites are classic examples of Internet applications that have large numbers of active users. Twitter, for instance, is one of the most famous social networking portals, with more than 300 million active users at any given time. However, unfortunately it is also a stage for users who are involved in unethical use of the Internet, such as cyberbullying. With such a staggering number of active users on the Internet, cyberbullying has become a widespread global phenomenon. It has extremely adverse effects on its victims. In some cases victims have committed suicide in response to the shame and hatred that is associated with cyberbullying . In this research, 1313 unique tweets were collected from Twitter. With the help of psychological studies referring to, the behavior of individuals and the use of dialects pertaining to verbal aggressiveness, 376 tweets were manually tagged as cyberbullying tweets in the first phase. In the next phase, every word in a tweet was individually categorised based on the pragmatics of language. In order to achieve this, tweets were categorised using Linguistic Inquiry and Word Count (LIWC), a psychometric evaluation tool that categorises text based on Linguistic Processes, Psychological Processes, Personal Concerns and Spoken Categories. Collectively, they add up to 67 sub-word-categories. In the next step of the psychometric evaluation, LIWC calculated the degree to which different word-categories were used by people in cyberbullying. Psychometric evaluation therefore aided in effective text categorisation and quantifying the degree of word usage, which was observed to be a gap in previous studies. As a result, tweets were converted to a multi-dimensional attribute relational numeric dataset. This dataset was very rich in terms of the information that it carried. This dataset was then used to train machine learning classifiers in Weka to develop a predictive model to detect cyberbullying. The data was randomly segmented 66% for training the predictive model and 34% for testing it. It was seen that the Random Forest classifier built the predictive model with a precision value of 0.97, indicating that binary classifiers outperformed the multiclass classifiers in detecting cyberbullying tweets.

Cyberbullying , Predictive analytics , Psychometric evaluation , Pragmatics of language , Twitter , Tweets , Text classification techniques , Psychometric analysis , Natural Language Processing , Linguistic Inquiry and Word Count (LIWC) , TAGS archiving tool , WEKA , Random forest; Multilayer perceptron; Support vector machines; Decision trees , Verbal aggression
Publisher's version
Rights statement