Repository logo
 

Towards Superior Quantization for Large Language Models

aut.embargoNo
aut.thirdpc.containsYes
aut.thirdpc.permissionYes
dc.contributor.advisorLi, Weihua
dc.contributor.advisorLiu, Yanbin
dc.contributor.authorZhang, Feng
dc.date.accessioned2025-05-12T02:24:16Z
dc.date.available2025-05-12T02:24:16Z
dc.date.issued2025
dc.description.abstractLarge Language Models (LLMs) have exhibited remarkable capabilities in tasks such as natural language comprehension, content generation, and knowledge retrieval. However, training and serving these models require substantial computational resources, posing a significant barrier to AI application development and research. To address these challenges, various model compression techniques have been explored, with quantization emerging as a key approach. Nonetheless, existing quantization methods predominantly apply uniform quantization configurations, failing to account for the varying quantization difficulty across different layers in billion-scale models. This results in a rigid memory-accuracy trade-off and leaves the potential for improving quantization accuracy through differentiated memory allocation largely unexplored. To bridge these research gaps, this thesis advances the study of LLM quantization with two key contributions. First, it introduces MXQ, a mixed-quantization method designed to provide a more flexible memory-accuracy balance. MXQ formulates a novel optimization approach to determine optimal layer-wise quantization parameters while enforcing overall memory constraints. Experimental results demonstrate that MXQ unlocks a broader spectrum of quantization configurations, simplifying the memory-accuracy trade-off while maintaining performance comparable to the baseline. Second, this thesis proposes SensiBoost and KurtBoost, two methods that enhance quantization accuracy by leveraging layer-sensitive features such as activation sensitivity and weight distribution kurtosis to identify critical layers. These approaches outperform existing baselines, achieving up to 9% lower perplexity with only a 2% increase in memory budget on Llama models.
dc.identifier.urihttp://hdl.handle.net/10292/19178
dc.language.isoen
dc.publisherAuckland University of Technology
dc.rights.accessrightsOpenAccess
dc.titleTowards Superior Quantization for Large Language Models
dc.typeThesis
thesis.degree.grantorAuckland University of Technology
thesis.degree.nameMaster of Computer and Information Sciences

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ZhangF.pdf
Size:
2.95 MB
Format:
Adobe Portable Document Format
Description:
Thesis

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
890 B
Format:
Item-specific license agreed upon to submission
Description:

Collections