Feng, SJHLai, EMKLi, W2024-10-152024-10-152024-09-23Electronics (Switzerland), ISSN: 2079-9292 (Print); 2079-9292 (Online), MDPI AG, 13(18), 3781-3781. doi: 10.3390/electronics131837812079-92922079-9292http://hdl.handle.net/10292/18134Data augmentation is crucial for enhancing the performance of text classification models when labelled training data are scarce. For natural language processing (NLP) tasks, large language models (LLMs) are able to generate high-quality augmented data. But a fundamental understanding of the reasons for their effectiveness remains limited. This paper presents a geometric and topological perspective on textual data augmentation using LLMs. We compare the augmentation data generated by GPT-J with those generated through cosine similarity from Word2Vec and GloVe embeddings. Topological data analysis reveals that GPT-J generated data maintains label coherence. Convex hull analysis of such data represented by their two principal components shows that they lie within the spatial boundaries of the original training data. Delaunay triangulation reveals that increasing the number of augmented data points that are connected within these boundaries correlates with improved classification accuracy. These findings provide insights into the superior performance of LLMs in data augmentation. A framework for predicting the usefulness of augmentation data based on geometric properties could be formed based on these techniques.© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).https://creativecommons.org/licenses/by/4.0/40 Engineering4009 Electronics, Sensors and Digital Hardware0906 Electrical and Electronic Engineering4009 Electronics, sensors and digital hardwareGeometry of Textual Data Augmentation: Insights from Large Language ModelsJournal ArticleOpenAccess10.3390/electronics13183781