Repository logo
 

Evaluation of Statistical Text Normalisation Techniques for Twitter

aut.relation.endpage418
aut.relation.startpage413
aut.relation.volume1en_NZ
dark.contributor.authorSosamphan, Pen_NZ
dark.contributor.authorLiesaputra, Ven_NZ
dark.contributor.authorYongchareon, Sen_NZ
dark.contributor.authorMohaghegh, Men_NZ
dc.date.accessioned2018-11-15T23:42:52Z
dc.date.available2018-11-15T23:42:52Z
dc.date.copyright2016en_NZ
dc.date.issued2016en_NZ
dc.description.abstractOne of the major challenges in the era of big data use is how to 'clean' the vast amount of data, particularly from micro-blog websites like Twitter. Twitter messages, called tweets, are commonly written in ill-forms, including abbreviations, repeated characters, and misspelled words. These 'noisy tweets' require text normalisation techniques to detect and convert them into more accurate English sentences. There are several existing techniques proposed to solve these issues, however each technique possess some limitations and therefore cannot achieve good overall results. This paper aims to evaluate individual existing statistical normalisation methods and their possible combinations in order to find the best combination that can efficiently clean noisy tweets at the character-level, which contains abbreviations, repeated letters and misspelled words. Tested on our Twitter sample dataset, the best combination can achieve 88% accuracy in the Bilingual Evaluation Understudy (BLEU) score and 7% Word Error Rate (WER) score, both of which are considered better than the baseline model.en_NZ
dc.identifier.citationIn Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016) ISBN 978-989-758-203-5, pages 413-418. DOI: 10.5220/0006083004130418
dc.identifier.doi10.5220/0006083004130418
dc.identifier.isbn9789897582035en_NZ
dc.identifier.urihttps://hdl.handle.net/10292/12019
dc.publisherSciTePressen_NZ
dc.relation.urihttp://www.scitepress.org/PublicationsDetail.aspx?ID=F6oA0NlwHc0=&t=1
dc.rightsThe SciTePress Digital Library (Science and Technology Publications, Lda) is an open access repository, who specializes in publishing conference proceedings.
dc.rights.accessrightsOpenAccessen_NZ
dc.subjectLexical Normalisationen_NZ
dc.subjectSocial Mediaen_NZ
dc.subjectStatistical Language Modelsen_NZ
dc.subjectText Miningen_NZ
dc.subjectText Normalisationen_NZ
dc.subjectTwitteren_NZ
dc.titleEvaluation of Statistical Text Normalisation Techniques for Twitteren_NZ
dc.typeConference Contribution
pubs.elements-id217975
pubs.organisational-data/AUT
pubs.organisational-data/AUT/Design & Creative Technologies
pubs.organisational-data/AUT/Design & Creative Technologies/Engineering, Computer & Mathematical Sciences

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
KDIR_2016_79.pdf
Size:
234.68 KB
Format:
Adobe Portable Document Format
Description:
Conference Contribution

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
RE4.10 Grant of Licence.docx
Size:
14.05 KB
Format:
Microsoft Word 2007+
Description: