LLMs

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
##
Results
Sort By:
Title
|
Date

Large Language Model Graph Customization with OpenVINO™ Transformations API

April 15, 2024

Authors: Xiake Sun, Wenyi Zou, Fiona Zhao

Introduction

A Large Language Model (LLM) is a type of artificial intelligence algorithm that uses deep learning techniques and massively large data sets to understand, summarize, generate and predict new content.

OpenVINO™ optimizes the deployment of LLMs, enhancing their performance and integration into various applications. We already provide general guide to use LLMs with OpenVINO, from model loading and conversion to advanced use cases.

In this blog, we will introduce some useful method to customize Large Language model’s graph with OpenVINO™ transformation API.

OpenVINO™ Runtime has three main transformation types:

  • Modelpass: straightforward way to work with ov::Model directly
  • Matcherpass: pattern-based transformation approach
  • Graphrewrite pass: container for matcher passes needed for efficient execution.
Figure1. OpenVINO™ transformations API structure overview

In this blog, we mainly use ov::pass::MatcherPassto customize model subgraph via pattern-based transformation.

Here are common steps to implement graph customization using ov::pass::MatcherPass.

  1. Create a pattern
  2. Implement a callback
  3. Register the pattern and Matcher
  4. Execute MatcherPass

In this blog, we will use an open-source LLMs Qwen1.5-7B-Chat-GPTQ-Int4 from Alibaba Cloud with guide for model conversion and graph customization methods.

Qwen Pytorch to OpenVINO™ Model conversion

Here we can use openvino.genai repo to convert Qwen1.5 GPTQ INT4 Pytroch model to OpenVINO™model.

conda create -n openvino.genai python=3.10
git clone https://github.com/openvinotoolkit/openvino.genai
cd llm_bench/python
pip install -r requirements.txt
python convert.py –model_id Qwen/Qwen1.5-7B-Chat-GPTQ-Int4 --output_dir  Qwen1.5-7B-Chat-GPTQ-Int4-OV --precision FP16 

Converted model can be find in path “Qwen1.5-7B-Chat-GPTQ-Int4-OV/pytorch/dldt/GPTQ_INT4-FP16/".

Insert custom layer to OpenVINO model

Vocabularysize in the context of LLMs refers to the total number of unique words, or tokens, that the model can recognize and use. The larger the vocabulary size,the more nuanced and detailed the model’s understanding of language can be,however, it also requires more computational and memory resources for deployment.  E.g. Qwen’s vocabulary size(151936) is almost 5x that Llama2 (32000), therefore additional optimization is required for efficient deployment.

We found that the following pattern existed in the Qwen model in Figure 2:

Figure 2: Workflow to reduce MatMul computation and memory usage for logits in Qwen model

To compute the first token generation for the input prompt with shape [1, seq_length], we need to calculate a MatMul operation based on two inputs.

  • First input is a reshape node output with shape[1, seq_length, 4096]
  • Second input is a constant value that contains the model’s vocabulary with shape [4096,151936]

Then Matmul calculates two inputs [1, seq_length, 4096] * [4096,151936] to output large logits [1, seq_length,151936]. However, for the next token prediction, we only need the last element [1,4096] in 1st dimension from logits for sampling.

The main idea is to insert a slice operation between Reshape and Matmul nodes to extract only the last element in 2nd dimension of reshape node output as the first input with shape [1,4096] for computation. Therefore, Matmul computation can be reduced from [1, seq_len, 4096] * [1, 4096, 151936] = [1, seq_len, 151936] to [1, 1, 4096] *[4096, 151936] = [1, 1, 151936], which can reduce first token latency and memory consumption.

Here is a sample code to implement the workflow defined in Figure2 to reduce Qwen's last Matmul computation and memory usage:

# -*- coding: utf-8 -*-
import numpy as np
import openvino as ov
from openvino.runtime import Core, Type
from openvino.runtime.passes import Manager, MatcherPass, WrapType, Matcher
from openvino.runtime import opset10 as ops
from openvino.preprocess import PrePostProcessor

class InsertSlice(MatcherPass):
    def __init__(self):
        MatcherPass.__init__(self)
        self.model_changed = False

        param = WrapType("opset10.Result")

        def callback(matcher: Matcher) -> bool:
            root = matcher.get_match_root()
            print("root: ", root)
            if root is None:
                return False
            root_output = matcher.get_match_value()
            print("root_output", root_output)
            root_name = root.get_friendly_name()
            if (len(root.get_output_partial_shape(0)) == 3):
                print(f"Find target root node name: {root_name}")
                parent = root.input_value(0).get_node()
                print(f"Find target parent node name: {parent.get_friendly_name()}")
                grand_parent = parent.input_value(0).get_node()
                print(f"Find grandparent node name: {grand_parent.get_friendly_name()}")
                grand_parent_output = parent.input(0).get_source_output()
                print("grand_parent_output: ", grand_parent_output)
                consumers = grand_parent_output.get_target_inputs()
                
                print(f"consumers: {consumers}")
                print("Original reshape node output shape:", grand_parent_output.get_partial_shape())
                start = np.array([0, -1, 0], dtype=np.int32)
                stop = np.array([1, -2, 4096], dtype=np.int32)
                step = np.array([1, -1, 1], dtype=np.int32)
                axes = np.array([0, 1, 2], dtype=np.int32)
                slice = ops.slice(grand_parent, start, stop, step, axes, name="inserted_slice")
                print("After insert slice node, output shape:", slice.output(0).get_partial_shape())

                for consumer in consumers:
                    consumer.replace_source_output(slice.output(0))
                self.model_changed = True
                # Use new operation for additional matching
                self.register_new_node(slice)
                                
                return True

        self.register_matcher(Matcher(param,"InsertSlice"), callback)

if __name__ == "__main__":
    model_path = " Qwen1.5-7B-Chat-GPTQ-Int4-OV/pytorch/dldt/GPTQ_INT4-FP16/ openvino_model.xml"
    modified_model_path = "Qwen1.5-7B-Chat-GPTQ-Int4-OV/pytorch/dldt/GPTQ_INT4-FP16/modified_openvino_model.xml")
    core = Core()
    ov_model = core.read_model(model_path)
    manager = Manager()
    manager.register_pass(InsertSlice())
    manager.run_passes(ov_model)
    ov.save_model(ov_model, modified_model_path)

We defined a OpenVINO™ transformation "InsertSlice" to find the logits (Results) node via ov::pass::MatchPass, then search along root->parent->grandparent node to find the Reshape node. Afterward, we insert a Slice node between the Reshape and Matmul nodes to extract the last element of seq_length with shape [1,1,4096]. In the end, we apply "InsertSlice" transformation to original OpenVINO™ model and save modified model on disk for deployment.

Modify model weights of specified layer in OpenVINO model

In case you want to update certain model layer weights after model training or fine-tuning/compression.

E.g. if you have an INT4 weight-compressed model using another model compression method, e.g. AWQ, you may want to transfer model weights optimized with the quantization method.

The most general method will be to convert the original model to OpenVINO™ model if the model direct conversion works. However, if first option is not works out of box, an alternative option is to replace the model weights from OpenVINO™ models with external fine-tuning model weights.

Figure 3: Update model weights of OpenVINO™ Model with external fine-tuned model weights

Here we introduce a common method to modify layer weights of Qwen model via OpenVINO™ transformation API.

As Figure 3 shows, the goal is to replace model weights and scale of the original Constant node with external fine-tuned weights and scale data.

At first, we use ov::pass::MatchPass method to find the Convert node after the target node. Then we create a new constant node with external weight saved as a numpy array. Please note, GPTQ int4 model weight is saved asuint4 (U4) binary format, while numpy can only represent data with numpy.uint8. Therefore, we use a help function to pack 2 uint4 binary data as 1 uint8 binary data. Then we replace the Convert input port from the original Constant node to the new Constant node.  Since the old constant node has no consumers and is neither the Result nor the Sink operation whose shared pointer counter is zero, the operation will be destructed and not be accessible anymore.

Here is a sample code to implement the workflow defined in Figure3 to replace Qwen Constant node via the new Constant node with external data:

# -*- coding: utf-8 -*-
import numpy as np
import openvino as ov
import torch 
from openvino.runtime import Core, Model, Type
from openvino.runtime.passes import Manager, GraphRewrite, MatcherPass, WrapType, Matcher
from openvino.runtime import opset10 as ops
from openvino.helpers import pack_data, unpack_data
                    
pytorch_to_ov_layer_mapping = [{"__module.model.layers.0.mlp.down_proj/aten::to/Convert": }]
packed_layername_tensor_dict_list = [{"name":"__module.model.layers.0.mlp.down_proj/aten::to/Convert","value":np.ones([1376*4, 4096],dtype=np.uint8)}]

class InsertWeights(MatcherPass):
    def __init__(self,packed_layername_tensor_dict_list):
        MatcherPass.__init__(self)
        self.model_changed = False

        param = WrapType("opset10.Convert")

        def callback(matcher: Matcher) -> bool:
            root = matcher.get_match_root()
            if root is None:
                return False
            root_output = matcher.get_match_value()
            for y in packed_layername_tensor_dict_list:
                #root_name = root.get_friendly_name().replace('.','_')
                root_name = root.get_friendly_name()
                print(f"root_name: {root_name}")
                if root_name.find(y["name"]) != -1 :
                    consumers = root.input_value(0).get_target_inputs()
                    unpacked_data = unpack_data(y["value"],Type.u4,y["value"].shape)
                    print(unpacked_data.shape)
                    new_weights = ops.constant(np.zeros(root.get_output_shape(0)),Type.u4,name=y["name"]+"_new_const")
                    print("new_weights: ", new_weights)
                    new_weights.data[:] = unpacked_data.ravel()
                    print(f"new_weights.shape: {new_weights.shape}")
                    
                    for consumer in consumers:
                        consumer.replace_source_output(new_weights.output(0))

                    # For testing purpose
                    self.model_changed = True
                    # Use new operation for additional matching
                    packed_layername_tensor_dict_list.remove(y)

            return True

        self.register_matcher(Matcher(param,"InsertWeights"), callback)

if __name__ == "__main__":
    model_path = "Qwen1.5-7B-Chat-GPTQ-Int4-OV/pytorch/dldt/GPTQ_INT4-FP16/openvino_model.xml"
    modified_model_path = "Qwen1.5-7B-Chat-GPTQ-Int4-OV/pytorch/dldt/GPTQ_INT4-FP16/modified_openvino_model.xml")
    core = Core()
    ov_model = core.read_model(model_path)
    manager = Manager()
    manager.register_pass(InsertWeights(packed_layername_tensor_dict_list))
    manager.run_passes(ov_model)
    ov.save_model(ov_model, modified_model_path)

We defined a OpenVINO™ transformation "InsertWeights" to find the target constant node via ov::pass::MatchPass, then we create a new Constat node with external numpy data and pack it as uint4 OpenVINO™ Tensor to replace original constant node in graph. In the end, we apply "InsertWeights" transformation to original OpenVINO™ model and save modified model on disk for deployment.

Conclusion

In this blog, we introduce how to apply graph customization based on OpenVINO™ model with OpenVINO™ transformation API. Furthermore, we show two examples of inserting layers & modifying layer weights based on Qwen LLM model with simple Python code.

Reference

QwenLM/Qwen1.5

OpenVINO™Transformation API

IntegrateOpenVINO™ with Your Application – Model Representation

Read More...

Enable chatGLM by creating OpenVINO™ stateful model and runtime pipeline

July 3, 2023

Authors: Zhen Zhao(Fiona), Cheng Luo, Tingqian Li, Wenyi Zou

Introduction

Since the Large Language Models (LLMs) become the hot topic, a lot Chinese language models have been developed and actively deployed in optimization platforms. chatGLM is one of the popular Chinese LLMs which are widely been evaluated. However, ChatGLM model is not yet a native model in Transformers, which means there remains support gap in official optimum. In this blog, we provide a quick workaround to re-construct the model structure by OpenVINO™ opset contains custom optimized nodes for chatGLM specifically and these nodes has been highly optimized by AMX intrinsic and MHA fusion.

*Please note, this blog only introduces a workaround of optimization method by creating OpenVINO™ stateful model for chatGLM.  This workaround has limitation of platform, which requires to use Intel® 4th Xeon Sapphire Rapids with AMX optimization. We do not promise the maintenance of this workaround.

Source link: https://github.com/luo-cheng2021/openvino/tree/luocheng/chatglm_custom/tools/gpt

To support more LLMs, including llama, chatglm2, gpt-neox/dolly, gpt-j and falcon. You can refer this link which not limited on SPR platform, also can compute from Core to Xeon:

Source link: https://github.com/luo-cheng2021/ov.cpu.llm.experimental

ChatGLM model brief

If we check with original model source of chatGLM, we can find that the ChatGLM is not compatible with Optimum ModelForCasualML, it defines the new class ChatGLMForConditionalGeneration. This model has 3 main modules (embedding, GLMBlock layers and lm_logits) during the pipeline loop, the structure is like below:

Model structure of chatGLM

As you can see, the whole pipeline actually require model with two different graphs, the first-time inference with input prompt tokens do not require KV cache as inputs for GLMBlock layers. Since the second iteration, the previous results of QKV Attention should become the inputs of current round model inference. Along with the length of generated token increased, there will remain a lot of large sized memory copies between model inputs and outputs during pipeline inference.  We can use ChatGLM6b default model configurations as an example, the memory copies between input and output arrays are like below pseudocode:


while(eos_token_id || max_seq_len){
    memcpy(model_inp, model_outp, num_layer*2*sizeof(model_outp)* hidden_size)
    model_outp.push_back(gen_token)
}

Therefore, two topics is the most important:

  • How we can optimize model inference pipeline to eliminate memory copy between model inputs and outputs
  • How we can put optimization efforts on GLMBlock module by reinvent execution graph

Extremely optimization by OpenVINO™ stateful model

Firstly, we need to analyze the structure of GLMBlock layer, and try to encapsulate a class to invoke OpenVINO™ opset with below workflow. Then serialize the graph to IR model(.xml, .bin).

Build OpenVINO™ stateful model for chatGLM

To build an OpenVINO™ stateful model, you can refer to this document to learn.

https://docs.openvino.ai/2022.3/openvino_docs_OV_UG_network_state_intro.html

OpenVINO™ also provide model creation sample to show how to build a model by opset.

https://github.com/openvinotoolkit/openvino/blob/master/samples/cpp/model_creation_sample/main.cpp

It is clear to show that the emphasized optimization block is the custom op of Attention for chatGLM. The main idea is to build up a global context to store and update pastKV results internally, and then use intrinsic optimization for Rotary Embedding and Multi-Head Attentions. In this blog, we provide an optimized the attention structure of chatGLM with AMX intrinsic operators.

At the same time, we use int8 to compress the weights of the Fully Connected layer, you are not required to compress the model by Post Training Quantization (PTQ) or process with framework for Quantization Aware Training(QAT).

Create OpenVINO™ stateful model for chatGLM

Please prepare your hardware and software environment like below and follow the steps to optimize the chatGLM:

Hardware requirements

Intel® 4th Xeon platform(codename Sapphire Rapids) and above

Software Validation Environment

Ubuntu 22.04.1 LTS

python 3.10.11 for OpenVINO™ Runtime Python API

GCC 11.3.0 to build OpenVINO™ Runtime

cmake 3.26.4

Building OpenVINO™ Source

  • Install system dependency and setup environment
  • Create and enable python virtual environment

$ conda create -n ov_py310 python=3.10 -y
$ conda activate ov_py310

  • Install python dependency

$ pip install protobuf transformers==4.30.2 cpm_kernels torch>=2.0 sentencepiece pandas

  • Build OpenVINO™ with GCC 11.3.0
  • Clone OpenVINO™ and update submodule

$ git clone https://github.com/luo-cheng2021/openvino.git -b luocheng/chatglm_custom
$ cd openvino && git submodule update --init --recursive

  • Install python dependency for building python wheels

$ python -m pip install -U pip 
$ python -m pip install -r ./src/bindings/python/src/compatibility/openvino/requirements-dev.txt
$ python -m pip install -r ./src/bindings/python/wheel/requirements-dev.txt

  • Create build directory

$ mkdir build && cd build

  • Build OpenVINO™ with CMake

$ cmake .. -DENABLE_LLMDNN=ON \
    -DBUILD_PYTHON_TESTS=ON \
    -DENABLE_CPU_DEBUG_CAPS=OFF \
    -DENABLE_DEBUG_CAPS=OFF  \
    -DCMAKE_BUILD_TYPE=Release \
    -DENABLE_INTEL_MYRIAD_COMMON=OFF \
    -DENABLE_INTEL_GNA=OFF \
    -DENABLE_OPENCV=OFF \
    -DENABLE_CPPLINT=ON \
    -DENABLE_CPPLINT_REPORT=OFF \
    -DENABLE_NCC_STYLE=OFF \
    -DENABLE_TESTS=ON \
    -DENABLE_OV_CORE_UNIT_TESTS=OFF \
    -DENABLE_INTEL_CPU=ON \
    -DENABLE_INTEL_GPU=OFF \
    -DENABLE_AUTO=OFF \
    -DENABLE_AUTO_BATCH=OFF \
    -DENABLE_MULTI=OFF \
    -DENABLE_HETERO=OFF \
    -DENABLE_INTEL_GNA=OFF \
    -DENABLE_PROFILING_ITT=ON\
    -DENABLE_SAMPLES=ON \
    -DENABLE_PYTHON=ON \
    -DENABLE_TEMPLATE=OFF  \
    -DENABLE_OV_ONNX_FRONTEND=OFF \
    -DENABLE_OV_PADDLE_FRONTEND=OFF \
    -DENABLE_OV_PYTORCH_FRONTEND=OFF \
    -DENABLE_OV_TF_FRONTEND=OFF \
    -DENABLE_OPENVINO_DEBUG=OFF \
    -DENABLE_CPU_DEBUG_CAPS=ON \
    -DCMAKE_INSTALL_PREFIX=`pwd`/install \
    -DCMAKE_INSTALL_RPATH=`pwd`/install/runtime/3rdparty/tbb/lib:`pwd`/install/runtime/3rdparty/hddl/lib:`pwd`/install/runtime/lib/intel64 \
    -Dgflags_Dir=`pwd`/../thirdparty/gflags/gflags/cmake
$ make --jobs=$(nproc --all)
$ make install

  • Install built python wheel for OpenVINO™ runtime and openvino-dev tools

$ pip install ./install/tools/openvino*.whl

  • Check system gcc version and conda runtime gcc version. If the system gcc version is higher than conda gcc version like below, you should update conda gcc version for OpenVINO runtime. (Optional)

##check system (OpenVINO compiling env) gcc version
$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

##check conda python (runtime env for OpenVINO later) gcc version
$ python
Python 3.10.11 (main, May 16 2023, 00:28:57) [GCC 11.2.0] on linux

##If sys gcc ver > conda gcc ver, upgrade conda gcc ver -> sys gcc ver
$ conda install -c conda-forge gcc=11.3.0

  • convert pytorch model to OpenVINO™ IR

$ cd ..
$ python tools/gpt/gen_chatglm.py /path/to/pytorch/model /path/to/ov/IR

Use OpenVINO Runtime API to build Inference pipeline for chatGLM  

We provide a demo by using transformers and OpenVINO™ runtime API to build the inference pipeline. In test_chatglm.py, we create a new class which inherit from transformers.PreTrainedModel. And we update the forward function by build up model inference pipeline with OpenVINO™ runtime Python API. Other member functions are migrated from ChatGLMForConditionalGeneration from modeling_chatglm.py, so that, we can make sure the input preparation work, set_random_seed, tokenizer/detokenizer and left pipelined operation can be totally same as original model source.

To enable the int8 weights compress, you just need a simple environment variable USE_INT8_WEIGHT=1. That is because during the model generation, we use int8 to compress the weights of the Fully Connected layer, and then it can use int8 weights to inference on runtime, you are not required to compress the model by framework or quantization tools.

Please follow below steps to test the chatGLM with OpenVINO™ runtime pipeline:

  • Run bf16 model

$ python3  tools/gpt/test_chatglm.py /path/to/pytorch/model /path/to/ov/IR --use=ov

  • Run int8 model

$ USE_INT8_WEIGHT=1 python test_chatglm.py /path/to/pytorch/model /path/to/ov/IR --use=ov

Weights compression reduces memory bandwidth utilization to improve inference speed

We use VTune for performance comparison analysis of model weights bf16 and int8. Comparative analysis of memory bandwidth and CPI rate (Table 1). When model weight is compressed to int8, it can reduce memory bandwidth utilization and CPI rate.

Memory Bandwidth Utilization when model weight precision is bf16
Memory Bandwidth Utilization when model weight precision is int8
Table 1. CPI Rate for different model weight precision

Clockticks per Instructions Retired(CPI) event ratio, also known as Cycles per Instructions, is one of the basic performance metrics for the hardware event-based sampling collection, also known as Performance Monitoring Counter (PMC) analysis in the sampling mode. This ratio is calculated by dividing the number of unhalted processor cycles(Clockticks) by the number of instructions retired. On each processor the exact events used to count clockticks and instructions retired may be different, but VTune Profiler knows the correct ones to use.

A CPI < 1 is typical for instruction bound code, while a CPI > 1 may show up for a stall cycle bound application, also likely memory bound.

Conclusion

Along with the upgrading of OpenVINO™ main branch, the optimization work in this workaround will be generalized and integrated into official release. It will be helpful to scale more LLMs model usage. Please refer OpenVINO™ official release and Optimum-intel OpenVINO™ backend to get official and efficient support for LLMs.

Read More...

AquilaChat-7B Language Model Enabling with Hugging Face Optimum Intel

July 3, 2023

Introduction

What is AquilaChat-7B Language Model?

Figure 1: Aquila Language Model from BAAI

Aquila Language Model is a set of open-source large language models (LLMs) developed by the Beijing Academy of Artificial Intelligence (BAAI). Aquila models support both Chinese and English, commercial license agreements, and compliance with Chinese domestic data regulations.

AquilaChat-7B is a conversational language model that supports Chinese and English dialogue. It is based on the Aquila-7B foundation model and fine-tuned using supervised fine-tuning (SFT). AquilaChat-7B original Pytorch model and configurations are publicly available here.

Hugging Face Optimum Intel

Figure 2: Hugging Face Optimum Intel

Hugging Face is one of the most popular open-source data science and machine learning platforms. It acts as a hub for AI experts and enthusiasts—like a GitHub for AI. Over 200,000 models are available across Natural language processing, Multimodal models, Computer Vision, and Audio domains.

Hugging Face provides wide support for model optimization and deployment of open-sourced LLMs such as LLaMA, Bloom, GPT-Neox, Dolly 2.0, to name a few. More details please refer to Open LLM Leaderboard.

Optimum-Intel provides a simple interface between the Hugging Face and OpenVINOTM ecosystem to leverage high-performance inference capabilities for Intel architecture.  Here is a simple example to show how to run Dolly 2.0 models with OVModelForCausalLM using OpenVINOTM runtime.

from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer

model_id = "databricks/dolly-v2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = OVModelForCausalLM.from_pretrained(model_id, export=True, use_cache=True)
prompt = "Deep learning is a type of machine learning"
inputs = tokenizer(prompt, return_tensors="pt")
output_ids = model.generate(**inputs)

Hola! So, for LLMs already supported by Hugging Face transformers and optimum, we can smoothly switch the model inference backend from Pytorch to OpenVINOTM by changing only two lines of code.

However, what if an LLM from an open-source community that not native supported by Hugging Face Transformers library? How can we still leverage the tools of Hugging Face and OpenVINOTM ecosystem for model optimization and deployment?

Indeed, AquilaChat-7B is a custom model for the Hugging Face Transformers. So, we use it as an example to elaborate the custom model enabling methodology step by step.

How to Enable a Custom Model on Hugging Face?

To leverage the Hugging Face ecosystem and optimization for AquilaChat-7B model, we need to convert the original Pytorch model to Hugging Face Format. Before we dive into conversion details, we need to figure out what is AquilaChat-7B’s model structure, tokenizer, and configurations.

According to Aquila’s official model description:

“The Aquila language model inherits the architectural design advantages of GPT-3 and LLaMA, replacing a batch of more efficient underlying operator implementations and redesigning the tokenizer for Chinese-English bilingual support. The Aquila language model is trained from scratch on high-quality Chinese and English corpora. “

Model Structure and Tokenizer

For model structure, Aquila Model adopts the original Meta LLaMA pytorch implementation, which combines RMSNorm (GPT-3) to improve training stability and Rotary Position Embedding (GPT-NeoX)to incorporate explicit relative position dependency in self-attention.

For tokenizer, instead of using byte-pair encoding (BPE) algorithms implemented by Sentence Piece, Aquila re-trained HuggingFace GPT-NeoX tokenizer with extended vocabulary (vocab_size =100008, including 8 special tokens, e.g. bos_token=100006, eos_token=100007, unk=0, pad=0 used for inference based on here.

Rotary Position Embedding

Rotary Position Embedding (RoPE) encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in the self-attention formulation. Compare to other position embedding methods, RoPE provides valuable properties such as flexibility of sequence length, long-term decay, and linear self-attention with relative position embedding. Based on the original paper, there are two mainstream implementations of RoPE:

As show in Figure 3, Meta LLaMA’s implementation directly use complex number to calculate rotary position embedding.

Figure 3. Meta LLaMA’s implementation of RoPE

As show in Figure 4, Google PaLM’s implementation expands the complex number operation and calculate sinusoidal functions in matrix equation of real numbers.

Figure 4.Google PaLM’s implementation of RoPE

Both RoPE implementations are valid for the Pytorch model. Hugging Face LLaMA implementation adopts PaLM’s RoPE implementation due to the limitation of complex type support for ONNX export.

Besides, Hugging Face provides a useful script convert_llama_weights_to_hf.py to convert the original Meta LLaMA Pytorch Model to Hugging Face Format as follows:

  • Extract Pytorch weights and convert Meta LlaMA RoPE implementation to Hugging Face RoPE implementation.
  • Convert tokenizer.model trained with Sentence Piece to Hugging Face LLaMA tokenizer.

Convert AquilaChat-7B Model to Hugging Face Format

Similarly, we provide a convert_aquila_weights_to_hf.py to convert AquilaChat-7B Model to Hugging Face Format.

  • Extract Pytorch weights and convert Aquila RoPE implementation to Hugging Face RoPE implementation
  • Initialize and save a Hugging Face GPT-NeoX Tokenizer with extended vocabulary based on original tokenizer configurations provided by Aquila.
  • Add a modeling_aquila.py to enable support forAutoModelForCausalLM and AutoTokenizer

Here is the converted Hugging Face version of AquilaChat-7B v0.6 model uploaded in Hugging Face.

You may convert pytorch weights to Hugging Face format in two steps:

  1. Download AquilaChat-7B Pytorch Model and configurations here
  2. Convert AquilaChat-7B Pytorch Model and configurations to Hugging Face Format
pip install transformers>=4.30.2
python convert_aquila_weights_to_hf.py --input_dir checkpoints/aquilachat-7b --model_size 7B --output_dir aquilachat-7b-hf

Hugging Face AquilaChat-7B Demo

Setup Environment

pip install optimum==1.8.8 optimum-intel[openvino]==1.9.1 onnx onnxruntime

Run inference with AutoModelForCausalLM

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('sammysun0711/aquilachat-7b-hf')
model = AutoModelForCausalLM.from_pretrained('sammysun0711/aquilachat-7b-hf', trust_remote_code=True)
model = model.eval()

question = '北京为什么是中国的首都?'
prompt = (
    '''A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.'''
    f'''###Human: {question}###Assistant:'''
)
with torch.no_grad():
    ret = model.generate(
        **tokenizer(prompt, return_tensors='pt').to('cpu'),
        do_sample=False,
        max_new_tokens=200,
        use_cache=True
    )
    print(tokenizer.decode(ret.tolist()[0]))
北京是中国的首都,是因为它在中国历史和文化中具有重要的地位,被选中作为中国的政治中心。在中国古代,北京是几个朝代的首都,如辽、金、元、明、清朝。在这些朝代,北京都是政治、经济、文化中心和军事重镇。此外,北京还是现代中国的政治中心,有着重要的国际地位。

Run inference with OVModelForCausalLM

from transformers import AutoTokenizer
from optimum.intel import OVModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('sammysun0711/aquilachat-7b-hf')
model = OVModelForCausalLM.from_pretrained('sammysun0711/aquilachat-7b-hf', export=True, use_cache=True, trust_remote_code=True)

question = '北京为什么是中国的首都?'
prompt = (
    '''A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.'''
    f'''###Human: {question}###Assistant:'''
)
ret = model.generate(
        **tokenizer(prompt, return_tensors='pt').to('cpu'),
        do_sample=False,
        max_new_tokens=200,
        use_cache=True
)
print(tokenizer.decode(ret.tolist()[0]))
北京是中国的首都,是因为它在中国历史和文化中具有重要的地位,被选中作为中国的政治中心。在中国古代,北京是几个朝代的首都,如辽、金、元、明、清朝。在这些朝代,北京都是政治、经济、文化中心和军事重镇。此外,北京还是现代中国的政治中心,有着重要的国际地位。

Conclusion

In this blog, we show how to convert a custom Large Language Model (LLM) to Hugging Face format to leverage efficient optimization and deployment with Hugging Face and OpenVINOTM Ecosystem.

Please note, this is the initial model enabling step for AquilaChat-7B model with OpenVINOTM. We will continue to optimize performance along with upgrading OpenVINOTM for LLM scaling.  Please refer to OpenVINOTM and Optimum-Intel official release to get latest efficient support for LLMs with OpenVINOTM backend.

Reference

Read More...