Step-by-Step

This document presents step-by-step instructions for auto-round llm quantization.

1 Prerequisite

Install auto-round or install from source

pip install auto-round

2. Prepare Calibration Dataset

Default Dataset

The NeelNanda/pile-10k in huggingface is adopted as the default calibration data and will be downloaded automatically from the datasets Hub. To customize a dataset, please kindly follow our dataset code. See more about loading huggingface dataset

Customized Dataset

Option 1: Pass a local json file path to dataset argument
Option 2: Register your dataset following the code and pass the new dataset and split args to initialize AutoRound object, e.g. autoround=Autoround(dataset="NeelNanda/pile-10k:train", ...)
Option 3: pass list of string or list of input_ids to dataset.
```
def customized_data():
    ##Important Notice!!! Autoround will drop data < args.seqlen and truncate data to args.seqlen
    data = ["AutoRound is an advanced weight-only quantization algorithm for low-bits LLM inference" * 240]
    data.append("AutoRound is an advanced weight-only quantization algorithm for low-bits LLM inference")
    return data


def customized_data_with_tokenizer(tokenizer, seqlen=2048):
    ##Import notice!!! Autoround will drop data < args.seqlen
    data = ["AutoRound is an advanced weight-only quantization algorithm for low-bits LLM inference" * 240]
    data.append("AutoRound is an advanced weight-only quantization algorithm for low-bits LLM inference")
    tokens = []
    for d in data:
        token = tokenizer(d, truncation=True, max_length=seqlen, return_tensors="pt").data
        tokens.append(token)
    return tokens
```
Dataset combination:We support combination of different datasets and parametrization of calibration datasets by using "--dataset ./tmp.json: concat,NeelNanda/pile-10k:split=train+val:num=256,mbpp:concat=True:num=128:apply_chat_template". Both local calibration file and huggingface dataset are supported. Through parametrization, users could specify splits of a dataset by setting " split=split1+split2".

Samples concatenation: A concatenation option could enable users to merge calibration samples. '--dataset NeelNanda/pile-10k:concat=True'

Apply chat template: '--dataset NeelNanda/pile-10k:apply_chat_template' would enable users to apply chat_template to calibration data before tokenization and is widely used by instruct-models in generation. Please note that samples shorter than args.seqlen will be dropped when concatenation option is not enabled.

Please use ',' to split datasets, ':' to split parameters of a dataset and '+' to add values for one targeted parameter.

2. Run Quantization

Default Settings:

auto-round --model facebook/opt-125m  --bits 4 --format "auto_round,auto_gptq" --disble_eval

Reduced GPU Memory Usage:
- enable low_gpu_mem_usage(more tuning cost)
- set "--train_bs 1 --gradient_accumulate_steps 8" (more tuning cost)
- reduce the train bs to 4(potential accuracy drop)
- reduce the seqlen to 512 (potential accuracy drop)
- or combine them
Reduced CPU Memory Usage:
- set "--low_cpu_mem_mode 1" to use block-wise mode, load the weights from disk of each block when tuning and release the memory of the block after tuning. (more tuning cost)
- set "--low_cpu_mem_mode 2" to use layer-wise mode, load the weights of each layer from disk when tuning, minimum memory consumption and also slowest running speed.
Speedup the tuning:
- reduce the seqlen to 512(potential large accuracy drop for some scenarios)
- reduce the train bs to 4(little accuracy drop. )
- or combine them

Enable quantized lm-head:

Currently only support in AutoRound format inference for this config

auto-round --model_name facebook/opt-125m  --bits 4 --group_size 128 --quant_lm_head --format "auto_round"

Enable marlin kernel:

To leverage auto-gptq marlin kernel, you need to install auto-gptq from source and export the model without sharding.
```
auto-round --model facebook/opt-125m  --sym --bits 4 --group_size 128  --format "auto_gptq:marlin"
```
Utilize the AdamW Optimizer:

Include the flag --adam. Note that AdamW is less effective than sign gradient descent in many scenarios we tested.
Code generation LLM:

We utilized mbpp for calibration, but your own training dataset is highly recommended. Please note that samples with seqlen < args.seqlen will be dropped in current version.
```
 auto-round --model Salesforce/codegen25-7b-multi --bits 4 --dataset "mbpp" --seqlen 128 "
```

4. Evaluation

4.1 Combine evaluation with tuning

We leverage lm-eval-harnessing for the evaluation
```
 auto-round --model facebook/opt-125m  --bits 4 --format "auto_round,auto_gptq" --tasks mmlu
```
The last format will be used in evaluation if multiple formats have been exported.

4.2 Eval the Quantized model

AutoRound format For lm-eval-harness, you could just call
```
auto-round --model="your_model_path" --eval  --tasks lambada_openai --eval_bs 16
```
Multiple gpu evaluation
```
auto-round --model="your_model_path" --eval  --device 0,1 --tasks lambada_openai --eval_bs 16
```
For other evaluation framework, if the framework could support Huggingface models, typically it could support AutoRound format, only you need to do is import the following in the beginning of your code
```
from auto_round import AutoRoundConfig
```

AutoGPTQ/AutoAWQ format

Please refer to their repo and check the evaluation framework's compatibility. For lm-eval-harness, you could just call

lm_eval --model hf --model_args pretrained="your_model_path" --device cuda:0 --tasks lambada_openai --batch_size 16

Multiple gpu evaluation

CUDA_VISIBLE_DEVICES=0,1 lm_eval --model hf --model_args pretrained="your_model_path",parallelize=True --tasks lambada_openai --batch_size 16

Inference

CPU: auto_round version >0.3.1, pip install intel-extension-for-pytorch(much higher speed on Intel CPU) or pip install intel-extension-for-transformers,

HPU: docker image with Gaudi Software Stack is recommended. More details can be found in Gaudi Guide.

CUDA: no extra operations for sym quantization, for asym quantization, need to install auto-round from source

AutoRound format

The following code will automatically detect device, and typically some error message will remind you to install some extra libraries

from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRoundConfig ## must import

quantized_model_path = "./tmp_autoround"
device="cuda"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))

To specify device use a different backend

from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRoundConfig

backend = "auto"  ##cpu, hpu, cuda, cuda:marlin(supported in auto_round>0.3.1 and 'pip install -v gptqmodel --no-build-isolation')
quantization_config = AutoRoundConfig(
    backend=backend
)
quantized_model_path = "./tmp_autoround"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
                                             device_map=backend.split(':')[0], quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))

6. Known Issues

Random quantization results in tuning some models
ChatGlm-V1 is not supported

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

step_by_step.md

step_by_step.md

Step-by-Step

1 Prerequisite

2. Prepare Calibration Dataset

Default Dataset

Customized Dataset

2. Run Quantization

4. Evaluation

4.1 Combine evaluation with tuning

4.2 Eval the Quantized model

Inference

AutoRound format

6. Known Issues

Files

step_by_step.md

Latest commit

History

step_by_step.md

File metadata and controls

Step-by-Step

1 Prerequisite

2. Prepare Calibration Dataset

Default Dataset

Customized Dataset

2. Run Quantization

4. Evaluation

4.1 Combine evaluation with tuning

4.2 Eval the Quantized model

Inference

AutoRound format

6. Known Issues