A Geometric Approach to Textual Augmented Data Filtering

Date
2024-09-09
Authors
Feng, SJH
Lai, EMK
Li, W
Supervisor
Item type
Conference Contribution
Degree name
Journal Title
Journal ISSN
Volume Title
Publisher
IOP Publishing
Abstract

Data augmentation is necessary if the amount of training data is insufficient for supervised learning. For natural language processing tasks, obtaining good quality augmented data is not easy. This paper introduces GATFilter, a novel method for filtering out inappropriate augmented textual data for text classification (TC). Utilizing geometric concepts, more specifically the principle component and convex hull analyses, this method adeptly preserves the semantic integrity of words within augmented texts. GATFilter is versatile and applicable across various types of textual augmentation methods. Experiments using several datasets and augmentation strategies showed that classifiers trained with GATFilter-filtered augmented data sets showed improvements in key performance metrics, including accuracy, precision, recall, and F1 score. The method’s efficacy is notably influenced by the quality of the underlying augmentation techniques, indicating its potential to complement and refine various text augmentation strategies. Furthermore, our analysis showed that GATFilter is particularly able to amplify the effectiveness of methods that generate good quality augmented data. GATFilter is openly available online on Github1, and as a Python package2

Description
Keywords
51 Physical Sciences , 0202 Atomic, Molecular, Nuclear, Particle and Plasma Physics , 0204 Condensed Matter Physics , 0299 Other Physical Sciences , 51 Physical sciences
Source
Rights statement
Content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.