awq example runs into error with llama 3.2 3b due to embedding layer #1089

tranlm · 2024-11-22T03:23:08Z

Describe the bug
When I run the example from examples/python/awq-quantized-model.md, but switching out phi-3 for llama-3.2-3b, I get an error message stating that AttributeError: 'NoneType' object has no attribute 'detach'. However, when I use the extra_option exclude_embeds=true, the onnx conversion step runs successfully.

To Reproduce
Steps to reproduce the behavior:

Follow the example from examples/python/awq-quantized-model.md, but switching out for model_name = "meta-llama/Llama-3.2-3B-Instruct"
At the onnx conversion step (after the quantization is complete), observe the error.

Expected behavior
The conversion to onnx should occur successfully, with no errors.

Screenshots

(base) C:\Users\Tranl\Documents>python -m onnxruntime_genai.models.builder -i C:\Users\Tranl\Documents\Llama-3.2-3B-Instruct-quant -o C:\Users\Tranl\Documents\Llama-3.2-3B-Instruct-onnx -p int4 -e dml -c ..\cache_dir
Valid precision + execution provider combinations are: FP32 CPU, FP32 CUDA, FP16 CUDA, FP16 DML, INT4 CPU, INT4 CUDA, INT4 DML
GroupQueryAttention (GQA) is used in this model.
Unpacking and repacking layer 0
Unpacking and repacking layer 1
Unpacking and repacking layer 2
Unpacking and repacking layer 3
Unpacking and repacking layer 4
Unpacking and repacking layer 5
Unpacking and repacking layer 6
Unpacking and repacking layer 7
Unpacking and repacking layer 8
Unpacking and repacking layer 9
Unpacking and repacking layer 10
Unpacking and repacking layer 11
Unpacking and repacking layer 12
Unpacking and repacking layer 13
Unpacking and repacking layer 14
Unpacking and repacking layer 15
Unpacking and repacking layer 16
Unpacking and repacking layer 17
Unpacking and repacking layer 18
Unpacking and repacking layer 19
Unpacking and repacking layer 20
Unpacking and repacking layer 21
Unpacking and repacking layer 22
Unpacking and repacking layer 23
Unpacking and repacking layer 24
Unpacking and repacking layer 25
Unpacking and repacking layer 26
Unpacking and repacking layer 27
Reading embedding layer
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Tranl\AppData\Roaming\Python\Python312\site-packages\onnxruntime_genai\models\builder.py", line 3267, in <module>
    create_model(args.model_name, args.input, args.output, args.precision, args.execution_provider, args.cache_dir, **extra_options)
  File "C:\Users\Tranl\AppData\Roaming\Python\Python312\site-packages\onnxruntime_genai\models\builder.py", line 3151, in create_model
    onnx_model.make_model(input_path)
  File "C:\Users\Tranl\AppData\Roaming\Python\Python312\site-packages\onnxruntime_genai\models\builder.py", line 2058, in make_model
    self.make_embedding(module.weight.detach().numpy())
                        ^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'detach'

Desktop (please complete the following information):

OS: Windows
Version 11 Pro
Anaconda3-2024.10-1-Windows-x86_64
torch 2.5.1+cu124
onnxruntime-genai-directml 0.5.1

Additional context
I've manually tried loading the awq quantized model and it looks fine. I can see the embeddings and grab them by attribute as well. Here is the output when I exclude embeddings:

(base) C:\Users\Tranl\Documents>python -m onnxruntime_genai.models.builder -i C:\Users\Tranl\Documents\Llama-3.2-3B-Instruct-quant -o C:\Users\Tranl\Documents\Llama-3.2-3B-Instruct-onnx -p int4 -e dml -c ..\cache_dir --extra_options exclude_embeds=true
Valid precision + execution provider combinations are: FP32 CPU, FP32 CUDA, FP16 CUDA, FP16 DML, INT4 CPU, INT4 CUDA, INT4 DML
Extra options: {'exclude_embeds': 'true'}
GroupQueryAttention (GQA) is used in this model.
Unpacking and repacking layer 0
Unpacking and repacking layer 1
Unpacking and repacking layer 2
Unpacking and repacking layer 3
Unpacking and repacking layer 4
Unpacking and repacking layer 5
Unpacking and repacking layer 6
Unpacking and repacking layer 7
Unpacking and repacking layer 8
Unpacking and repacking layer 9
Unpacking and repacking layer 10
Unpacking and repacking layer 11
Unpacking and repacking layer 12
Unpacking and repacking layer 13
Unpacking and repacking layer 14
Unpacking and repacking layer 15
Unpacking and repacking layer 16
Unpacking and repacking layer 17
Unpacking and repacking layer 18
Unpacking and repacking layer 19
Unpacking and repacking layer 20
Unpacking and repacking layer 21
Unpacking and repacking layer 22
Unpacking and repacking layer 23
Unpacking and repacking layer 24
Unpacking and repacking layer 25
Unpacking and repacking layer 26
Unpacking and repacking layer 27
Reading decoder layer 0
Reading decoder layer 1
Reading decoder layer 2
Reading decoder layer 3
Reading decoder layer 4
Reading decoder layer 5
Reading decoder layer 6
Reading decoder layer 7
Reading decoder layer 8
Reading decoder layer 9
Reading decoder layer 10
Reading decoder layer 11
Reading decoder layer 12
Reading decoder layer 13
Reading decoder layer 14
Reading decoder layer 15
Reading decoder layer 16
Reading decoder layer 17
Reading decoder layer 18
Reading decoder layer 19
Reading decoder layer 20
Reading decoder layer 21
Reading decoder layer 22
Reading decoder layer 23
Reading decoder layer 24
Reading decoder layer 25
Reading decoder layer 26
Reading decoder layer 27
Reading final norm
Reading LM head
Saving ONNX model in C:\Users\Tranl\Documents\Llama-3.2-3B-Instruct-onnx
2024-11-21 21:16:27,418 onnxruntime.quantization.matmul_4bits_quantizer [INFO] - skip to quantize /model/constant_nodes/TensorProto.INT64/1D/1 ...
<...etc>
<...etc>
<...etc>
2024-11-21 21:16:27,441 onnxruntime.quantization.matmul_4bits_quantizer [INFO] - skip to quantize /model/layers.28/final_norm_layernorm/SkipLayerNorm ...
2024-11-21 21:16:27,441 onnxruntime.quantization.matmul_4bits_quantizer [INFO] - start to quantize /lm_head/MatMul ...
2024-11-21 21:16:28,155 onnxruntime.quantization.matmul_4bits_quantizer [INFO] - complete quantization of /lm_head/MatMul ...
Saving GenAI config in C:\Users\Tranl\Documents\Llama-3.2-3B-Instruct-onnx
Saving processing files in C:\Users\Tranl\Documents\Llama-3.2-3B-Instruct-onnx for GenAI

The text was updated successfully, but these errors were encountered:

tranlm · 2024-11-22T21:09:12Z

Hi @baijumeswani - I just want to confirm that I'm specifically running the example for dml.

kunal-vaishnavi · 2024-11-22T22:22:33Z

The weights for the embedding and language modeling head (LM head) are similar as one is the transpose of the other. Some models that have very large vocabulary sizes tie the embedding and LM head weights together by saving one copy of the weights on disk. When the weights are tied, they can be stored either in the embedding or in the LM head.

The below code snippet sets the LM head's attributes from the embedding's attributes if not already set.

onnxruntime-genai/src/python/py/models/quantized_model.py

Lines 340 to 345 in 17061e0

    
           # Set LM head weights + biases if not already set 
        
           if isinstance(self.lm_head, TensorModule) and self.lm_head.weight is None: 
        
               # Embedding and LM head share same weights + biases (lm_head.weight == embedding.weight and lm_head.bias == embedding.bias) 
        
               self.lm_head.weight = self.embedding.weight 
        
               if self.lm_head.bias is not None: 
        
                   self.lm_head.bias = self.embedding.bias

However, the reverse way to set the embedding's attributes from the LM head's attributes is not added. For LLaMA-3.2, it appears that the .safetensors files store the embedding weights in model.lm_head.weight instead of model.embed_tokens.weight.

To temporarily unblock you, can you add the following in quantized_model.py after the above code snippet?

# This is a copy of the above code snippet where references to `embedding` are replaced with `lm_head`
# and references to `lm_head` are replaced with `embedding`

# Set embedding weights + biases if not already set
if isinstance(self.embedding, TensorModule) and self.embedding.weight is None:
    # LM head and embedding share same weights + biases (embedding.weight == lm_head.weight and embedding.bias == lm_head.bias)
    self.embedding.weight = self.lm_head.weight
    if self.embedding.bias is not None:
        self.embedding.bias = self.lm_head.bias

The logic for handling the bias needs to be re-visited in both cases before merging a fix. In some models, the condition should be if bias is None. In other models, the condition should be if bias is not None. You can locally change the logic in both code snippets as needed to get the right weights and biases.

microsoft-github-policy-service bot added the ep:DML label Nov 22, 2024

baijumeswani removed the ep:DML label Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

awq example runs into error with llama 3.2 3b due to embedding layer #1089

awq example runs into error with llama 3.2 3b due to embedding layer #1089

tranlm commented Nov 22, 2024 •

edited

Loading

tranlm commented Nov 22, 2024

kunal-vaishnavi commented Nov 22, 2024

awq example runs into error with llama 3.2 3b due to embedding layer #1089

awq example runs into error with llama 3.2 3b due to embedding layer #1089

Comments

tranlm commented Nov 22, 2024 • edited Loading

tranlm commented Nov 22, 2024

kunal-vaishnavi commented Nov 22, 2024

tranlm commented Nov 22, 2024 •

edited

Loading