AquilaChat-7B Language Model Enabling with Hugging Face Optimum Intel

Introduction

What is AquilaChat-7B Language Model?

Figure 1: Aquila Language Model from BAAI

Aquila Language Model is a set of open-source large language models (LLMs) developed by the Beijing Academy of Artificial Intelligence (BAAI). Aquila models support both Chinese and English, commercial license agreements, and compliance with Chinese domestic data regulations.

AquilaChat-7B is a conversational language model that supports Chinese and English dialogue. It is based on the Aquila-7B foundation model and fine-tuned using supervised fine-tuning (SFT). AquilaChat-7B original Pytorch model and configurations are publicly available here.

Hugging Face Optimum Intel

Figure 2: Hugging Face Optimum Intel

Hugging Face is one of the most popular open-source data science and machine learning platforms. It acts as a hub for AI experts and enthusiasts—like a GitHub for AI. Over 200,000 models are available across Natural language processing, Multimodal models, Computer Vision, and Audio domains.

Hugging Face provides wide support for model optimization and deployment of open-sourced LLMs such as LLaMA, Bloom, GPT-Neox, Dolly 2.0, to name a few. More details please refer to Open LLM Leaderboard.

Optimum-Intel provides a simple interface between the Hugging Face and OpenVINOTM ecosystem to leverage high-performance inference capabilities for Intel architecture.  Here is a simple example to show how to run Dolly 2.0 models with OVModelForCausalLM using OpenVINOTM runtime.

from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer

model_id = "databricks/dolly-v2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = OVModelForCausalLM.from_pretrained(model_id, export=True, use_cache=True)
prompt = "Deep learning is a type of machine learning"
inputs = tokenizer(prompt, return_tensors="pt")
output_ids = model.generate(**inputs)

Hola! So, for LLMs already supported by Hugging Face transformers and optimum, we can smoothly switch the model inference backend from Pytorch to OpenVINOTM by changing only two lines of code.

However, what if an LLM from an open-source community that not native supported by Hugging Face Transformers library? How can we still leverage the tools of Hugging Face and OpenVINOTM ecosystem for model optimization and deployment?

Indeed, AquilaChat-7B is a custom model for the Hugging Face Transformers. So, we use it as an example to elaborate the custom model enabling methodology step by step.

How to Enable a Custom Model on Hugging Face?

To leverage the Hugging Face ecosystem and optimization for AquilaChat-7B model, we need to convert the original Pytorch model to Hugging Face Format. Before we dive into conversion details, we need to figure out what is AquilaChat-7B’s model structure, tokenizer, and configurations.

According to Aquila’s official model description:

“The Aquila language model inherits the architectural design advantages of GPT-3 and LLaMA, replacing a batch of more efficient underlying operator implementations and redesigning the tokenizer for Chinese-English bilingual support. The Aquila language model is trained from scratch on high-quality Chinese and English corpora. “

Model Structure and Tokenizer

For model structure, Aquila Model adopts the original Meta LLaMA pytorch implementation, which combines RMSNorm (GPT-3) to improve training stability and Rotary Position Embedding (GPT-NeoX)to incorporate explicit relative position dependency in self-attention.

For tokenizer, instead of using byte-pair encoding (BPE) algorithms implemented by Sentence Piece, Aquila re-trained HuggingFace GPT-NeoX tokenizer with extended vocabulary (vocab_size =100008, including 8 special tokens, e.g. bos_token=100006, eos_token=100007, unk=0, pad=0 used for inference based on here.

Rotary Position Embedding

Rotary Position Embedding (RoPE) encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in the self-attention formulation. Compare to other position embedding methods, RoPE provides valuable properties such as flexibility of sequence length, long-term decay, and linear self-attention with relative position embedding. Based on the original paper, there are two mainstream implementations of RoPE:

As show in Figure 3, Meta LLaMA’s implementation directly use complex number to calculate rotary position embedding.

Figure 3. Meta LLaMA’s implementation of RoPE

As show in Figure 4, Google PaLM’s implementation expands the complex number operation and calculate sinusoidal functions in matrix equation of real numbers.

Figure 4.Google PaLM’s implementation of RoPE

Both RoPE implementations are valid for the Pytorch model. Hugging Face LLaMA implementation adopts PaLM’s RoPE implementation due to the limitation of complex type support for ONNX export.

Besides, Hugging Face provides a useful script convert_llama_weights_to_hf.py to convert the original Meta LLaMA Pytorch Model to Hugging Face Format as follows:

  • Extract Pytorch weights and convert Meta LlaMA RoPE implementation to Hugging Face RoPE implementation.
  • Convert tokenizer.model trained with Sentence Piece to Hugging Face LLaMA tokenizer.

Convert AquilaChat-7B Model to Hugging Face Format

Similarly, we provide a convert_aquila_weights_to_hf.py to convert AquilaChat-7B Model to Hugging Face Format.

  • Extract Pytorch weights and convert Aquila RoPE implementation to Hugging Face RoPE implementation
  • Initialize and save a Hugging Face GPT-NeoX Tokenizer with extended vocabulary based on original tokenizer configurations provided by Aquila.
  • Add a modeling_aquila.py to enable support forAutoModelForCausalLM and AutoTokenizer

Here is the converted Hugging Face version of AquilaChat-7B v0.6 model uploaded in Hugging Face.

You may convert pytorch weights to Hugging Face format in two steps:

  1. Download AquilaChat-7B Pytorch Model and configurations here
  2. Convert AquilaChat-7B Pytorch Model and configurations to Hugging Face Format
pip install transformers>=4.30.2
python convert_aquila_weights_to_hf.py --input_dir checkpoints/aquilachat-7b --model_size 7B --output_dir aquilachat-7b-hf

Hugging Face AquilaChat-7B Demo

Setup Environment

pip install optimum==1.8.8 optimum-intel[openvino]==1.9.1 onnx onnxruntime

Run inference with AutoModelForCausalLM

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('sammysun0711/aquilachat-7b-hf')
model = AutoModelForCausalLM.from_pretrained('sammysun0711/aquilachat-7b-hf', trust_remote_code=True)
model = model.eval()

question = '北京为什么是中国的首都?'
prompt = (
    '''A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.'''
    f'''###Human: {question}###Assistant:'''
)
with torch.no_grad():
    ret = model.generate(
        **tokenizer(prompt, return_tensors='pt').to('cpu'),
        do_sample=False,
        max_new_tokens=200,
        use_cache=True
    )
    print(tokenizer.decode(ret.tolist()[0]))
北京是中国的首都,是因为它在中国历史和文化中具有重要的地位,被选中作为中国的政治中心。在中国古代,北京是几个朝代的首都,如辽、金、元、明、清朝。在这些朝代,北京都是政治、经济、文化中心和军事重镇。此外,北京还是现代中国的政治中心,有着重要的国际地位。

Run inference with OVModelForCausalLM

from transformers import AutoTokenizer
from optimum.intel import OVModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('sammysun0711/aquilachat-7b-hf')
model = OVModelForCausalLM.from_pretrained('sammysun0711/aquilachat-7b-hf', export=True, use_cache=True, trust_remote_code=True)

question = '北京为什么是中国的首都?'
prompt = (
    '''A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.'''
    f'''###Human: {question}###Assistant:'''
)
ret = model.generate(
        **tokenizer(prompt, return_tensors='pt').to('cpu'),
        do_sample=False,
        max_new_tokens=200,
        use_cache=True
)
print(tokenizer.decode(ret.tolist()[0]))
北京是中国的首都,是因为它在中国历史和文化中具有重要的地位,被选中作为中国的政治中心。在中国古代,北京是几个朝代的首都,如辽、金、元、明、清朝。在这些朝代,北京都是政治、经济、文化中心和军事重镇。此外,北京还是现代中国的政治中心,有着重要的国际地位。

Conclusion

In this blog, we show how to convert a custom Large Language Model (LLM) to Hugging Face format to leverage efficient optimization and deployment with Hugging Face and OpenVINOTM Ecosystem.

Please note, this is the initial model enabling step for AquilaChat-7B model with OpenVINOTM. We will continue to optimize performance along with upgrading OpenVINOTM for LLM scaling.  Please refer to OpenVINOTM and Optimum-Intel official release to get latest efficient support for LLMs with OpenVINOTM backend.

Reference