Llama-2-70b-chat-hf-onnx-int4
Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository of INT4 weight only quantization for the 70B fine-tuned model in ONNX format.
Note: Use of this model is governed by the Meta license. Please ensure you have accepted that License and got access to the FP32 model before downloading models here.
This INT4 model is generated with Intel® Neural Compressor's weight-only quantization method.
Model Detail | Description |
---|---|
Model Authors - Company | Intel |
Date | August 29, 2023 |
Version | 1 |
Type | Text Generation |
Paper or Other Resources | - |
License | https://ai.meta.com/resources/models-and-libraries/llama-downloads/ |
Questions or Comments | Community Tab |
Intended Use | Description |
---|---|
Primary intended uses | You can use the raw model for text generation inference |
Primary intended users | Anyone doing text generation inference |
Out-of-scope uses | This model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people. |
Export to ONNX Model
The FP32 model is exported with meta-llama/Llama-2-70b-chat-hf:
optimum-cli export onnx --model meta-llama/Llama-2-70b-chat-hf --task text-generation ./llama2_70b_chat
Build ONNX Runtime
Build ONNX Runtime from resource to support MatMulWithQuantWeight
op. You can refer to build-onnx-runtime-for-inferencing for more prerequisites.
git clone -b sub_byte_quant_zp https://github.com/microsoft/onnxruntime.git
cd onnxruntime
./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync --skip_tests --build_wheel
Run Quantization
The weight-only quantization cofiguration is as below:
dtype | group_size | scheme | algorithm |
---|---|---|---|
INT4 | 32 | asym | RTN |
Run INT4 weight-only quantization with Intel® Neural Compressor. We provide the key code below. For the complete quantization script, please refer to llama weight-only example.
from neural_compressor import quantization, PostTrainingQuantConfig
config = PostTrainingQuantConfig(
approach="weight_only",
calibration_sampling_size=[8],
op_type_dict={".*": {"weight": {"bits": 4,
"algorithm": ["RTN"],
"scheme": ["asym"],
"group_size": 32}}},)
q_model = quantization.fit(
"/path/to/llama2_70b_chat/decoder_model.onnx", # FP32 model path
config,
calib_dataloader=dataloader)
q_model.save("/path/to/Llama-2-70b-chat-hf-onnx-int4/decoder_model.onnx") # INT4 model path
Evaluation
Operator Statistics
Below shows the operator statistics in the INT4 ONNX model:
Op Type | Total | INT4 weight | FP32 |
---|---|---|---|
MatMul | 641 | 561 | 80 |
Evaluation of perplexity
Evaluate the model with evaluation API of Intel® Extension for Transformers on lambada_openai task.
from intel_extension_for_transformers.evaluation.lm_eval import evaluate
model_path = "/path/to/Llama-2-70b-chat-hf-onnx-int4"
tokenizer = "Intel/Llama-2-70b-chat-hf-onnx-int4"
batch_size = 64
tasks=["lambada_openai"]
results = evaluate(
model="hf-causal",
model_args="pretrained=" + model_path + ",tokenizer="+ tokenizer,
batch_size=batch_size,
tasks=tasks,
model_format="onnx"
)
Model | Model Size (GB) | lambada_openai acc | lambada_openai ppl |
---|---|---|---|
FP32 | 257 | 0.7543 | 2.6181 |
INT4 | 43 | 0.7510 | 2.6561 |
- Downloads last month
- 0