Su

Yang

July 4, 2024

August 21, 2024

SDPA Enabling for Custom Model

Authors: Su Yang, Xiake Sun, Fiona Zhao

Introduction

To enable the SDPA fusion on GPU, we firstly need to convert model IR with SDPA Op.

Create new class SdpaAttention in the modeling_ custom_model.py using torch.scaled_dot_product_attention. This Pytorch Op could be matched and converted into OpenVINO SDPA Op.

‍
Refer to Phi3SdpaAttention, this module is inherited from Phi3Attention as the weights of the module stay untouched. The only changes are on the forward pass to adapt to SDPA API.

torch.scaled_dot_product_attention

From the equivalent implementation, the target is to replace the related Pytorch Ops(like softmax, matmul and dropout) with torch.nn.functional.scaled_dot_product_attention.

Fig2: implementation equivalent from PyTorch 2.3 documentation

For some custom model, the equivalent code in is as follow:

attn_weights = torch.matmul(query_states, key_states.transpose(2, 3))
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=query_states.dtype)
…
attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout_rate, training=self.training)
attn_output = torch.matmul(attn_weights, value_states)

The corresponding implementation of scaled_dot_product_attention:

attn_output = torch.nn.functional.scaled_dot_product_attention(
            query_states,
            key_states,
            value_states,
            attn_mask=attention_mask,
            dropout_p=self.attention_dropout_rate if self.training else 0.0, 
            is_causal= True and attention_mask is None and q_len > 1, 
            scale=1)

SDPA’s Scaling factor with different head_size

For the Pytorch model with different head_size, this scaled_dot_product_attention need to be modified with the scaling factor.

attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
…

attn_output = torch.nn.functional.scaled_dot_product_attention(
            query_states,
            key_states,
            value_states,
            attn_mask=attention_mask,
            dropout_p=self.attention_dropout_rate if self.training else 0.0, 
            is_causal= True and attention_mask is None and q_len > 1, 
            scale=1/math.sqrt(self.head_dim)
        )

The SDPA Operator has already implemented. The SDPA fusion on GPU supports for head size =128 at OV24.2 release.

The OV24.3 release relax SDPA head size limitations for LLMs from 128 only to a range of 64 to 256, thanks to Sergey’s PR.

‍

Usage

Replace the original modeling_custom_model.py with the new script (with SdpaAttention) in the Pytorch model folder.

Notice:

After converting model again, check the SDPA layers (“aten::scaled_dot_product_attention”) in the OV IR .xml file.
Double check the OpenVINO executable graph for the SDPA enabling.
Don’t forget to check the accuracy of the Pytorch model inference, after any modification with modeling_custom_model.py.

Conclusion

In this blog, we introduce how to use torch.scaled_dot_product_attention to enable the SDPA for custom model.

Performance improvement with SDPA on MTL iGPU is depended on the model structure. SDPA enabling for the custom model is the base for further optimization like Page Attention.

‍

InternVL2-4B model enabling with OpenVINO

October 16, 2024

Authors: Hongbo Zhao, Fiona Zhao

Introduction

InternVL2.0 is a series of multimodal large language models available in various sizes. The InternVL2-4B model comprises InternViT-300M-448px, an MLP projector, and Phi-3-mini-128k-instruct. It delivers competitive performance comparable to proprietary commercial models across a range of capabilities, including document and chart comprehension, infographics question answering, scene text understanding and OCR tasks, scientific and mathematical problem solving, as well as cultural understanding and integrated multimodal functionalities.

You can find more information on github repository: https://github.com/zhaohb/InternVL2-4B-OV

OpenVINO^TM backend on InternVL2-4B

Step 1: Install system dependency and setup environment

Create and enable python virtual environment

conda create -n ov_py310 python=3.10 -y
conda activate ov_py310

Clone the InternVL2-4B-OV repository from github

git clonehttps://github.com/zhaohb/InternVL2-4B-OV
cd InternVL2-4B-OV

Install python dependency

pip install -r requirement.txt
pip install --pre -U openvino openvino-tokenizers --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

Step2: Get HuggingFace model

huggingface-cli download --resume-download OpenGVLab/InternVL2-4B --local-dir InternVL2-4B--local-dir-use-symlinks False
cp modeling_phi3.py  InternVL2-4B/modeling_phi3.py
cp modeling_intern_vit.py   InternVL2-4B/modeling_intern_vit.py

Step 3: Export to OpenVINO™ model

python test_ov_internvl2.py -m ./InternVL2-4B -ov ./internvl2_ov_model -llm_int4_com -vision_int8 -llm_int8_quan -convert_model_only

Step4: Simple inference test with OpenVINO™

python test_ov_internvl2.py -m ./InternVL2-4B -ov ./internvl2_ov_model -llm_int4_com -vision_int8-llm_int8_quan

Question: Please describe the image shortly.

Answer:

The image features a close-up view of a red panda resting on a wooden platform. The panda is characterized by its distinctive red fur, white face, and ears. The background shows a natural setting with green foliage and a wooden structure.

Here are the parameters with descriptions:

python test_ov_internvl2.py --help
usage: Export InternVL2 Model to IR [-h] [-m MODEL_ID] -ov OV_IR_DIR [-d DEVICE] [-pic PICTURE] [-p PROMPT] [-max MAX_NEW_TOKENS] [-llm_int4_com] [-vision_int8] [-llm_int8_quant] [-convert_model_only]
options:
  -h, --help   show this help message and exit  
  -m MODEL_ID, --model_id MODEL_ID   model_id or directory for loading     
  -ov OV_IR_DIR, --ov_ir_dir OV_IR_DIR     output directory for saving model  
  -d DEVICE, --device DEVICE   inference device  
  -pic PICTURE, --picture PICTURE  picture file 
  -p PROMPT, --prompt PROMPT    prompt  
  -max MAX_NEW_TOKENS, --max_new_tokens MAX_NEW_TOKENS    max_new_tokens  
  -llm_int4_com, --llm_int4_compress  llm int4 weight scompress  
  -vision_int8, --vision_int8_quant  vision int8 weights quantize  
  -llm_int8_quant, --llm_int8_quant      llm int8 weights dynamic quantize  
  -convert_model_only, --convert_model_only      convert model to ov only, do not do inference test

Supported optimizations

1. Vision model INT8 quantization and SDPA optimization enabled

2. LLM model INT4 compression

3. LLM model INT8 dynamic quantization

4. LLM model with SDPA optimization enabled

Summary

This blog introduces how to use the OpenVINO™ python API to run the pipeline of the Internvl2-4B model, and uses a variety of acceleration methods to improve the inference speed.

‍

moondream2 model enabling with OpenVINO

October 14, 2024

Introduction

moondream2 is a small vision language model designed to run efficiently on edge devices. Although the model has a small number of parameters, it provides high-performance visual processing capabilities. It can quickly understand and process input images and respond to user queries. The model was developed by VikhyatK and is released under the permissive Apache 2.0 license, allowing for commercial use.

You can find more information on github repository: https://github.com/zhaohb/moondream2-ov

OpenVINO^TM backend on moondream2

Step 1: Install system dependency and setup environment

Create and enable python virtual environment

conda create -n ov_py310 python=3.10 -y
conda activate ov_py310

Clone themoondream2-ov repository from gitHub

git clone https://github.com/zhaohb/moondream2-ov
cd moondream2-ov

Install python dependency

pip install -r requirement.txt
pip install --pre -U openvino openvino-tokenizers --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

Step 2: Get HuggingFace model

git lfs install
git clone https://hf-mirror.com/vikhyatk/moondream2
git checkout 48be9138e0faaec8802519b1b828350e33525d46

Step 3: Export OpenVINO™ models and simple inference test with OpenVINO™

python3 test_ov_moondream2.py -m /path/to/moondream2 -o /path/to/moondream2_ov

Question: Describe this image.

Answer:

The image shows a modern white desk with a laptop, a lamp, and a notebook on it, set against a gray wall and a wooden floor.

‍

MiniCPM-V-2 model enabling with OpenVINO

August 21, 2024

Introduction

MiniCPM is an End-Side LLM developed by ModelBest Inc. and TsinghuaNLP. MiniCPM-V is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. The models take image and text as inputs and provide high-quality text outputs. MiniCPM-V 2.0 is an efficient version with promising performance for deployment. The model is built based on SigLip-400Mand MiniCPM-2.4B, connected by a perceiver resampler. On this blog, we provide the OpenVINO™ optimization for MiniCPM-V 2.0 on Intel® platforms.

You can find more information on GitHub repository:https://github.com/OpenBMB/MiniCPM-V

OpenVINO™backend on Minicpm-V-2

Step 1: Install system dependency and setup environment

Create and enable python virtual environment

conda create -n ov_py310 python=3.10 -y
conda activate ov_py310

Clone the MiniCPM-V repository from GitHub

git clone https://github.com/wenyi5608/MiniCPM-V.git -b ov_runtime

Chage the current directory to the MiniCPM-V OpenVINO™ Runtime folder

 cd MiniCPM-V/eval_mm/openvinoruntime/

Install python dependency

pip install -r requirement.txt 
pip install --pre -U openvino openvino-tokenizers --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

Step2: Export to OpenVINO™ models

python ov_convert_minicpm-v-2.py -m /path/to/ openbmb/MiniCPM-V-2 -o /path/to/ MiniCPM-V-2 _ov

Step3: Simple inference test with OpenVINO™

python ov_minicpm-v2-test.py -m /path/to/ MiniCPM-V-2 _ov -pic /path/to/hk_OCR.jpg -p “Describe the content of the image”

Question: Describe the content of the image

Answer:

The image captures the vibrant and bustling atmosphere of abusy city street in Hong Kong. The street is lined with an array of neon signsand billboards, each one advertising a different business or establishment. Thesigns are in a variety of languages, including English, Chinese, and Japanese,reflecting the multicultural nature of the city.

The street itself is a hive of activity with several busesand a tram making their way through the traffic. The vehicles are in motion,adding a dynamic element to the scene.

The sky above is a beautiful gradient of colors,transitioning from a deep blue at the top to a lighter shade at the bottom.This suggests that the photo was taken during either sunrise or sunset, castinga warm glow over the cityscape.

The image also contains several text elements, including thenames of various establishments and the brand names of products. These textsadd another layer of information to the scene, providing insights into thenature of the businesses and the products they offer.

Overall, the image provides a vivid snapshot of life in HongKong, capturing the city's vibrant energy and the diverse range of businessesand products that make up its bustling streets.

‍

Su

Yang

SDPA Enabling for Custom Model

Introduction

torch.scaled_dot_product_attention

SDPA’s Scaling factor with different head_size

Usage

Conclusion

Related Articles

InternVL2-4B model enabling with OpenVINO

Introduction

OpenVINOTM backend on InternVL2-4B

Supported optimizations

Summary

moondream2 model enabling with OpenVINO

Introduction

OpenVINOTM backend on moondream2

MiniCPM-V-2 model enabling with OpenVINO

Introduction

OpenVINO™backend on Minicpm-V-2

OpenVINO^TM backend on InternVL2-4B

OpenVINO^TM backend on moondream2