Repository logo
 

Geometry of Textual Data Augmentation: Insights from Large Language Models

aut.relation.endpage3781
aut.relation.issue18
aut.relation.journalElectronics (Switzerland)
aut.relation.startpage3781
aut.relation.volume13
dc.contributor.authorFeng, SJH
dc.contributor.authorLai, EMK
dc.contributor.authorLi, W
dc.date.accessioned2024-10-15T01:06:11Z
dc.date.available2024-10-15T01:06:11Z
dc.date.issued2024-09-23
dc.description.abstractData augmentation is crucial for enhancing the performance of text classification models when labelled training data are scarce. For natural language processing (NLP) tasks, large language models (LLMs) are able to generate high-quality augmented data. But a fundamental understanding of the reasons for their effectiveness remains limited. This paper presents a geometric and topological perspective on textual data augmentation using LLMs. We compare the augmentation data generated by GPT-J with those generated through cosine similarity from Word2Vec and GloVe embeddings. Topological data analysis reveals that GPT-J generated data maintains label coherence. Convex hull analysis of such data represented by their two principal components shows that they lie within the spatial boundaries of the original training data. Delaunay triangulation reveals that increasing the number of augmented data points that are connected within these boundaries correlates with improved classification accuracy. These findings provide insights into the superior performance of LLMs in data augmentation. A framework for predicting the usefulness of augmentation data based on geometric properties could be formed based on these techniques.
dc.identifier.citationElectronics (Switzerland), ISSN: 2079-9292 (Print); 2079-9292 (Online), MDPI AG, 13(18), 3781-3781. doi: 10.3390/electronics13183781
dc.identifier.doi10.3390/electronics13183781
dc.identifier.issn2079-9292
dc.identifier.issn2079-9292
dc.identifier.urihttp://hdl.handle.net/10292/18134
dc.languageen
dc.publisherMDPI AG
dc.relation.urihttps://www.mdpi.com/2079-9292/13/18/3781
dc.rights© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
dc.rights.accessrightsOpenAccess
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subject40 Engineering
dc.subject4009 Electronics, Sensors and Digital Hardware
dc.subject0906 Electrical and Electronic Engineering
dc.subject4009 Electronics, sensors and digital hardware
dc.titleGeometry of Textual Data Augmentation: Insights from Large Language Models
dc.typeJournal Article
pubs.elements-id571156

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Feng et al_2024_Geometry of textual data augmentation.pdf
Size:
3.87 MB
Format:
Adobe Portable Document Format
Description:
Journal article