Skip to content

Latest commit

 

History

History
executable file
·
165 lines (128 loc) · 6.06 KB

README.md

File metadata and controls

executable file
·
165 lines (128 loc) · 6.06 KB

FrameQuant

This repository contains the code for FrameQuant, a Post-Training-Quantization algorithm for Transformer models. FrameQuant is able to quantize Transformers to ultra low bitwidths(almost 2 bits) while providing flexibility by allowing fractional bitwidths. Please find our paper at arXiv:2403.06082 for more details on our algorithm and experiments.

FrameQuant

Installation

All our dependencies are provided in the requirements.txt. (We also have a docker image with all of them installed).

pip install -r requirements.txt

We use the Fast Hadamard Transform implementation from Dao-AILab for our fast random projections. Follow their instructions to install the package or run

bash install_fast_hadamard_tx.sh

Docker image

We also have a docker image on the dockerhub with all the packages installed. From within the FrameQuant directory, simply run

docker run --ipc=host --gpus all -it -v "$PWD:/workspace" harshauwm163/fq:0.97

to start using FrameQuant.

Running FrameQuant

Quantization a model using FrameQuant

This respository contains the code for our core algorithm. We provide implementation for quantizing Llama2 models (more models to come!). Our code is developed on top of GPTQ and GPTQ-for-LLaMa. Here is the command to run FrameQuant on LLaMa2-7B.

# For Llama model located at /data/hf-llama-2-7b; run

python llama.py /data/hf-llama-2-7b c4 --eval --new-eval --tff_transform --tff_redundancy 1.0 --wbits 2 --save

Inference for a model quantized using FrameQuant

The command below is used for running inference using the quantized model.

# /data/hf-llama-2-7b should contain the config files and the tokenizer for the original model
# ./results/Llama_FQ should contain packed_model.ckpt, generated by the quantization script above 

python inference.py /data/hf-llama-2-7b ./results/Llama_FQ

Experimental Results

Here is a comparision of FrameQuant to other PTQ methods.

Validation accuracy of Quantized Vision Transformers on ImageNet-1K

ImageNet-1K

Performance of Post Training Quantized (PTQ) LLMs from the Llama2 class on language modeling.
ppl on WikiText2 ppl on C4
Method bits 7B 70B
Full-Precison $16$ $5.68$ $3.32$
GPTQ $2$ $6.4e3$ $140.5$
QuIP $2$ $26.02$ $6.21$
FrameQuant $1.0$ $2$ $14.85$ $5.50$
FrameQuant $1.1$ $2.2$ $8.48$ $4.67$
Method bits 7B 70B
Full-Precison $16$ $7.26$ $5.71$
GPTQ $2$ $2265.09$ $68.83$
QuIP $2$ $26.61$ $8.65$
FrameQuant $1.0$ $2$ $19.62$ $7.85$
FrameQuant $1.1$ $2.2$ $11.23$ $6.86$
Performance of Post Training Quantized (PTQ) LLMs from the OPT class on language modeling.
ppl on WikiText2 ppl on C4
Method bits 125M 1.3B 2.7B 6.7B
Full-Precison $16$ $27.65$ $14.62$ $12.47$ $10.86$
GPTQ $2$ $5.7e3$ $8.9e3$ $9.1e3$ $3.1e3$
QuIP $2$ $913.0$ $37.59$ $22.86$ $15.67$
FrameQuant $1.0$ $2$ $345.7$ $30.54$ $20.67$ $15.72$
FrameQuant $1.1$ $2.2$ $\textbf{131.2}$ $\textbf{22.68}$ $\textbf{15.86}$ $\textbf{13.53}$
Method bits 125M 1.3B 2.7B 6.7B
Full-Precison $16$ $26.56$ $16.07$ $14.34$ $12.71$
GPTQ $2$ $2203.89$ $4139.91$ $4058.91$ $528.41$
QuIP $2$ $543.62$ $28.91$ $21.49$ $16.92$
FrameQuant $1.0$ $2$ $226.17$ $27.90$ $20.74$ $17.28$
FrameQuant $1.1$ $2.2$ $\textbf{91.29}$ $\textbf{22.39}$ $\textbf{17.75}$ $\textbf{15.33}$
Downstream performance of FrameQuant

The LLMs quantized with FrameQuant also perform better on the downstream tasks. We use LM-eval-harness to evaluate Llama2-7B model quantized using various methods on the downstream tasks and show that FrameQuant achieves best performance on all the tasks.

Method bits ARC (challenge) ARC (easy) BoolQ HellaSwag PIQA WinoGrande
Full-Precision $16$ $43.43$ $76.35$ $77.71$ $57.16$ $78.07$ $69.06$
GPTQ $2$ $22.44$ $24.58$ $41.19$ $25.93$ $51.85$ $50.43$
QuIP $2$ $22.27$ $42.76$ $50.31$ $34.04$ $61.75$ $52.64$
FrameQuant $1.0$ $2$ $23.98$ $55.39$ $63.52$ $36.76$ $66.65$ $55.80$
FrameQuant $1.1$ $2.2$ $\textbf{31.91}$ $\textbf{65.53}$ $\textbf{67.95}$ $\textbf{46.46}$ $\textbf{73.07}$ $\textbf{63.61}$

Quantized models on Huggingface

Please find the quantized models available on Huggingface

model bitwidth link
Llama2-7B 2 bits https://huggingface.co/uw-madison/Llama2-7B-FrameQuant-2bit

Cite

Please do cite our work, if you find it interesting!

@InProceedings{adepuFQIcml24,
  author = 	 {Harshavardhan Adepu and Zhanpeng Zeng and Li Zhang and Vikas Singh},
  title = 	 {FrameQuant: Flexible Low-Bit Quantization for Transformers},
  OPTcrossref =  {},
  OPTkey = 	 {},
  booktitle = {Proceedings of International Conference on Machine Learning (ICML)},
  OPTpages = 	 {},
  year = 	 {2024},
  venue = 	 {ICML},
  OPTeditor = 	 {},
  OPTvolume = 	 {},
  OPTnumber = 	 {},
  OPTseries = 	 {},
  OPTaddress = 	 {},
  month = 	 {July},
  OPTorganization = {},
  OPTpublisher = {},
  OPTnote = 	 {},
  OPTannote = 	 {}
}