Repository logo
 

A Mixed Quantization Approach for Data-Free Quantization of LLMs

aut.relation.conferenceInternational Conference on Agents and Artificial Intelligence (ICAART)
aut.relation.endpage363
aut.relation.startpage353
aut.relation.volume2
dc.contributor.authorZhang, Feng
dc.contributor.authorLiu, Yanbin
dc.contributor.authorLi, Weihua
dc.contributor.authorWang, Xiaodan
dc.contributor.authorBai, Quan
dc.contributor.editorRocha, AP
dc.contributor.editorSteels, L
dc.contributor.editorvan den Herik, HJ
dc.date.accessioned2025-03-19T20:08:11Z
dc.date.available2025-03-19T20:08:11Z
dc.date.issued2025-02-23
dc.description.abstractLarge Language Models (LLMs) have demonstrated significant capabilities in intelligent activities such as natural language comprehension, content generation, and knowledge retrieval. However, training and deploying these models require substantial computation resources, setting up a significant barrier for developing AI applications and conducting research. Various model compression techniques have been developed to address the demanding computational resource issue. Nonetheless, there has been limited exploration into high-level quantization strategy to offer better flexibility of balancing the trade-off between memory usage and accuracy. We propose an effective mixed-quantization method named MXQ to bridge this research gap for a better memory-accuracy balance. Specifically, we observe that the weight distributions of LLMs vary considerably from layer to layer, resulting in different tolerances to quantization errors. Motivated by this, we derive a novel quantization optimisation formulation to solve for the layer-wise quantization parameters, while enforcing the overall quantization memory consumption budget into the constraints. The new formulation can be efficiently solved by converting to a mixed integer programming problem. Experiments shows that our method can achieve the 1% accuracy loss goal with additional bit budget or further reduce memory usage on Llama models. This unlocks a wide range of quantization options and simplifies memory-accuracy trade-off.
dc.identifier.citationIn Proceedings of the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025) - Volume 2, pages 353-363.
dc.identifier.doi10.5220/0013159100003890
dc.identifier.isbn9789897587375
dc.identifier.issn2184-433X
dc.identifier.urihttp://hdl.handle.net/10292/18885
dc.publisherSCITEPRESS – Science and Technology Publications, Lda.
dc.relation.urihttps://www.scitepress.org/publishedPapers/2025/131591/
dc.rightsPaper published under CC license (CC BY-NC-ND 4.0)
dc.rights.accessrightsOpenAccess
dc.titleA Mixed Quantization Approach for Data-Free Quantization of LLMs
dc.typeConference Contribution
pubs.elements-id593386

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Zhang et al_2025_A mixed quantization approach for data-free quantization of LLMs.pdf
Size:
395.36 KB
Format:
Adobe Portable Document Format
Description:
Conference contribution