Towards Superior Quantization for Large Language Models

Zhang, Feng

Towards Superior Quantization for Large Language Models

Files

Thesis(2.95 MB)

Date

2025

Authors

Zhang, Feng

Supervisor

Li, Weihua

Liu, Yanbin

Item type

Thesis

Degree name

Master of Computer and Information Sciences

Publisher

Auckland University of Technology

Abstract

Large Language Models (LLMs) have exhibited remarkable capabilities in tasks such as natural language comprehension, content generation, and knowledge retrieval. However, training and serving these models require substantial computational resources, posing a significant barrier to AI application development and research. To address these challenges, various model compression techniques have been explored, with quantization emerging as a key approach. Nonetheless, existing quantization methods predominantly apply uniform quantization configurations, failing to account for the varying quantization difficulty across different layers in billion-scale models. This results in a rigid memory-accuracy trade-off and leaves the potential for improving quantization accuracy through differentiated memory allocation largely unexplored. To bridge these research gaps, this thesis advances the study of LLM quantization with two key contributions. First, it introduces MXQ, a mixed-quantization method designed to provide a more flexible memory-accuracy balance. MXQ formulates a novel optimization approach to determine optimal layer-wise quantization parameters while enforcing overall memory constraints. Experimental results demonstrate that MXQ unlocks a broader spectrum of quantization configurations, simplifying the memory-accuracy trade-off while maintaining performance comparable to the baseline. Second, this thesis proposes SensiBoost and KurtBoost, two methods that enhance quantization accuracy by leveraging layer-sensitive features such as activation sensitivity and weight distribution kurtosis to identify critical layers. These approaches outperform existing baselines, achieving up to 9% lower perplexity with only a 2% increase in memory budget on Llama models.

Permanent link

http://hdl.handle.net/10292/19178

Collections

Masters Theses

Full item page

Towards Superior Quantization for Large Language Models

Files

Date

Authors

Supervisor

Item type

Degree name

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Source

DOI

Publisher's version

Rights statement

Permanent link

Collections