GPTQ

A quantization method that has been gaining popularity is GPTQ, which does post-training quantization of language models. GPTQ was introduced in the paper “GPTQ: accurate post-training quantization for generative pre-trained transformers” in Mar-2023. The name GPTQ stands for Generative Post-Training Quantization.

GPTQ adopts a quantization scheme where weights are quantized as int4 while activations remain in float16. During inference, weights are dequantized on the fly and the actual compute is performed in float16. The GPTQ paper mentioned that they can quantize the 175B GPT model in around 4 hours using a single A100 with 80GB ram.

To perform quantizaion, GPT notes that the objective is to find a matrix of quantized weights $\hat{W}$ which minimizes the squared error relative to the full precision layer output, i.e. \(\text{argmin}_{\hat{W}}|| WX - \hat{W}X ||_{2}^{2}\)

When to use bitsandbytes vs GPTQ?

While GPTQ is able to quantize pretrained language models into 4-bits, note that the bitsandbytes library is also able to load a pretrained model in 4-bits. So a natural question is: when do I use GPTQ vs bitsandbytes? Here are some of their differences:

  • While bitsandbytes allows loading a pretrained model in 4-bits, it (currently) cannot serialize or save the quantized model to disk.
  • A recent blog article from Huggingface shows that inference/generation using 4-bits bitsandbytes is slower than performing inference using a 4-bit GPTQ quantized model.

To summarize, a recommended approach is the following:

  • Load a base model (e.g. LLaMA) in 16-bits/8-bits.
  • Use LoRA to fine-tune certain adapter modules using 8-bits (or qLoRA if you want to fine-tune using 4-bits), saving the resultant adapter weights to a directory, e.g. <adapter_dir>
  • Use code similar to the following to load the adapter, merge it into the base model, and then save the merged model (of course, you can also merge and save directly after running LoRA)
      def merge_and_save(adapter_dir, merged_model_dir):
          config = PeftConfig.from_pretrained(adapter_dir)
    
          model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, torch_dtype=torch.bfloat16, device_map="auto")
          tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
    
          model = PeftModel.from_pretrained(model, adapter_dir)
    
          merged_model = model.merge_and_unload()
          merged_model.save_pretrained(merged_model_dir)
          tokenizer.save_pretrained(merged_model_dir)
    
  • Finally, use code similar to the following to load the above merged model, and apply GPTQ to quantize.
      def apply_gptq(merged_model_dir, gptq_model_dir):
          quantization_config = GPTQConfig(
              bits=4,
              dataset=["c4"],
              desc_act=False,
          )
          tokenizer = AutoTokenizer.from_pretrained(merged_model_dir)
          quant_model = AutoModelForCausalLM.from_pretrained(
              merged_model_dir, quantization_config=quantization_config,
              device_map="auto"
          )
    
          # Save the quantized model
          quant_model.save_pretrained(gptq_model_dir, safe_serialization=True)
          tokenizer.save_pretrained(gptq_model_dir)
    

For more information, check out the following:

  • https://huggingface.co/blog/gptq-integration
  • https://huggingface.co/blog/overview-quantization-transformers
Written on August 13, 2023