OpenVINO Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Sort By:

OpenVINO Optimization-LLM Distributed

Author : Kunda,Xu / Zhai, Xiuchuan / Sun, Xiaoxia / Li, Tingqian / Shen, Wanglei

With the continuous development of deep learning technology, large models have become key technologies in many fields, such as LLM, Multimodal, etc. The training of large models requires a lot of computing resources, and the reasoning and deployment of LLM also require a lot of resources, which promotes the widespread application of distributed parallel technology. Distributed parallel technology can distribute a large model to multiple computing nodes and accelerate the training and reasoning process of the model through parallel computing. It mainly includes strategies such as Model Parallel(MP), Data Parallel(DP), Pipeline Parallel(PP),  and Tensor Parallel(TP).

In this blog we will focus on OpenVINO's distributed optimization of LLM by tensor parallel.

Tensor parallel is a technique used to fit a large model in multiple device. For example, when multiplying the input tensors with the first weight tensor, the matrix multiplication is equivalent to splitting the weight tensor column-wise, multiplying each column with the input separately, and then concatenating the separate outputs. These outputs are then transferred from the CPUs/GPUs and concatenated together to get the final result.

Fig 1. Tensor Parallel Schematic

Taking the implementation of Fully Connect(FC) layer tensor parallel as an example, we introduce OpenVINO's distributed optimization of LLM. In the statistics of the inference time overhead of LLM tasks, we found that due to the large parameter size of LLM, the FC layer will take up a large time overhead. By optimizing the latency of the FC layer, the LLM first token/second token latency can be effectively reduced.

Fully connect layer refers to a neural network in which each neuron applies a linear transformation to the input vector through a weights matrix. As a result, all possible connections layer-to-layer are present, meaning every input of the input vector influences every output of the output vector.

In OpenVINO, SplitFC is used to split the FC layer into two device nodes to implement tensor parallelism.

SplitFC detects FC CPU operation with and without compressed weights. And then splits the FC into several small FCs by output channel according to sub stream number. The goal is that the executor can dispatch the split FCs to different numa nodes in the system. As a result, the split FCs can be executed at the parallel level.

Fig 2. Fully Connect Layer Split workflow

The following describes how to use the distributed features in OpenVINO to deploy the model on two CPU nodes to improve the performance of LLM.

Step 1. setup environment

$ python -m venv venv_ov_distributed
$ source venv_ov_distributed/bin/activate
$ pip install openvino-nightly

Step 2. clone openvino.genai source code and install requirement

$ git clone
$ cd ./openvino.genai/ll_bench/python
$ pip install -r requirements.txt 

Step 3. prepare OpenVINO IR LLM model

The conversion script for preparing benchmarking models, allows to reproduce IRs stored on shared drive. Make sure the prerequisty requirements.txt has already install.

$ python --model_id "model_id_or_path" --output_dir "out_dir"

# ======= Usage example ==========

$ python --model_idmeta-llama/Llama-2-7b-chat-hf --output_dir models/llama-2-7b-chat

Step 4. LLM run pipe OpenVINO genai benchmark on single socket

$ python -m "model" -d "device" -r "report_csv" -f "framework" -p "prompt text" -n "num_iters"
# =========== e.g. =============
$ python -m models/llama-2-7b-chat/pytorch/dldt/FP32 -n 2
$ python -m models/llama-2-7b-chat/pytorch/dldt/FP32 -p "What is openvino?" -n 2
$ python -m models/llama-2-7b-chat/pytorch/dldt/FP32 -pf prompts/llama-2-7b-chat_l.jsonl -n 2
Fig 3. Single socket LLM inference screenshot

Step 5. LLM run pipe OpenVINO genai benchmark distributed on double sockets

$ echo'{"MODEL_DISTRIBUTION_POLICY":"TENSOR_PARALLEL"}' > ov_distributed_config.json

$ python llm_bench/python/ -m models/llama-2-7b-chat/pytorch/dldt/FP32  -pf  lm_bench/python/prompts/llama-2-7b-chat_l.jsonl -bs 1 -lc ov_distributed_config.json
Fig 4. Double socket LLM distributed inference screenshot

As can be seen from the figure above, nstream=1, nireq=1,LLM runs on two sockets, and from the printout of OV_config, we can see that MODEL_DISTRIBUED_POLICY has turned on the TENSOR_PARALLEL feature;

Using OpenVINO distributed feature can significantly reduce the first token/second token latency for general LLM models.

No items found.

Enable 2D Lip Sync Wav2Lip Pipeline with OpenVINO Runtime

July 15, 2024

Authors: Xiake Sun, Kunda Xu

1. Introduction

Lip sync technologies are widely used for digital human use cases, which enhance the user experience in dialog scenarios.

Wav2Lip is a novel approach to generate accurate 2D lip-synced videos in the wild with only one video and an audio clip. Wav2Lip leverages an accurate lip-sync “expert" model and consecutive face frames for accurate, natural lip motion generation.

In this blog, we introduce how to enable and optimize Wav2Lippipeline with OpenVINOTM.

Here is Wav2Lip pipeline overview:

Figure 1: Wav2Lip pipeline overview

2. Setup Environment

$ git clone
$ cd Wav2Lip
$ conda create -n wav2lip python=3.8
$ conda activate wav2lip
$ pip install -r requirments.txt
$ sudo apt-get install ffmpeg

Download the Wav2lip pytorch model from link and move it to the checkpoints folder.

3. Pytorch to OpenVINOTM Model Conversion

$ python

The exported OpenVINOTM model will be saved in the checkpoints folder.

4. Run pipeline inference with OpenVINOTM Runtime

$ python --face_detection_path checkpoints/face_detection.xml --wav2lip_path checkpoints/wav2lip.xml --inference_device CPU --face data_video_sun_5s.mp4 --audio data_audio_sun_5s.wav

Here are the parameters with descriptions:

--face_detection_path: path of face detection OpenVINOTMIR

--wav2lip_path: path of wav2lip openvinoTM IR

--inference_device: specify the device to run OpenVINOTMinference.

--face: input video with face information

--audio: input audio with voice information

--static: set True to use single frame for face detection for fast inference

The generated video will be saved as results/result_voice.mp4

Here is an example to compare original video and generated video after the Wav2Lip pipeline:

Figure 2: Original input video
Figure 3: Wav2Lip generated video

5. Conclusion

In this blog, we introduce how to deploy wav2lip pipeline with OpenVINOTM as follows:

  • Support Pytorch model to OpenVINOTM model conversion.
  • Run and optimize wav2lip pipeline with OpenVINOTM runtime.
No items found.

OpenVINO Enable Digital Human-TTS (GPT-SoVITs)

July 14, 2024

Authors : Kunda, Xu / Aidova, Ekaterina


GPT-SoVITS is a powerful voice cloning model that supports a small amount of voice conversion and text-to-speech. It supports voice reasoning in Chinese, English, and Japanese.

You only need to provide a 5-second voice sample to experience voice cloning with 80%~95% similarity. If you provide a 1-minute voice sample, you can get close to the effect of a real person and train a high-quality TTS model!

As the founder of RVC Voice Changer (GitHub nickname: RVC-Boss), he recently opened up a cross-language voice cloning project. GPT-SoVITs has attracted highly praised recommendations from the industry as soon as it went online, and it has received 1.4k Stars on GitHub in less than two days. Its voice cloning effect close to that of a real person and the use of Zero-short have made GPT-SoVITs highly recognized in the field of digital humans.

Although there are many tutorials on how to use GPT-SoVITs online, in this article I will show you how to optimize it through OpenVINO and deploy it on the CPU. GPT-SoVITs belongs to the category of TTS tasks, but the difference is that it can add sound features to the generated audio by introducing additional reference audio (ref_wav), making the output audio effect closer to real people speaking.

The pipeline structure of GPT-SoVITs

Fig.1-GPT-SoVITs pipeline

GPT-SOVITS adds a residual quantization layer to the original SOVITS input. Referring to VALL-E, the input of this quantization layer contains the text features and timbre features of the audio. GPT-SoVITs mainly consists of four parts:  text preprocessing, audio preprocessing, VALL-E, VITS-decode

- Text preprocessing :  
Convert text encoding information (text_ids) into pronunciation encoding information (phoneme_ids)
Use the Bert model to encode the phoneme_ids information, and take the third-to-last layer tensor as the result output
features and Bert features are phoneme features for text, similar to pinyin

- Audio preprocessing :
Use the cn_hubert_encoder model to sample the audio information and extract the code_ids of ref_audio
Based on ref_audio, extract the frequency ref_audio_spec frame feature

Use for predict pred_sementic and idx, The idx is used to truncate pred_sementic (remove the features of the reference audio)

- VITs-decode :
Use VITS-decode to decode and get the output

HuBert sub model

The HuBert model is designed to extract audio autoencoder features and is developed by the Facebook AI research team. The goal is to learn a universal speech representation that can capture important information in speech signals, such as speech content, speaker identity, emotional state, etc. These representations can be used in a variety of downstream tasks, such as automatic speech recognition (ASR), speaker identification, sentiment analysis, etc.

Reference paper :

Fig.2-HuBert model structure

VALL-E sub model

VALL-E is a neural codec model developed by Microsoft Research, focusing on the field of speech synthesis. The characteristic of VALL-E is that it can generate synthesized speech that meets the speech characteristics of the target speaker through a small number of target speech samples (such as a 3-second speech clip). VALL-E completes the generation of speech coding based on Encoder. The GPT module of GPT-SOVITS implements the process from text to speech coding. Referring to SOVITS, a residual quantization layer is added to the original entrance. The input of this quantization layer contains the text features and timbre features of the audio.

Reference paper :

Fig.3-VALL-E model structure

SoVITs sub model

The core module of GPT-SOVITS is not much different from SOVITS, which is an end-to-end text-to-speech(TTS) synthesis model. It combines variational inference and adversarial learning to generate high-quality, natural-sounding speech. It is still divided into two parts:

  • Generator based on VAE + FLOW
  • Discriminator based on multi-scale classifier

Reference paper :

Fig.5-GPT-SoVITs 4 sub model time cost

By counting the inference time of the entire GPT-SoVITs pipeline, we found that the two models with the largest time overhead are text2sementic (VALL-E) and vits (VITs-decode). The total overhead of the two models accounts for 87% of the entire pipeline. Therefore, optimizing these two models through OpenVINO is very necessary to improve the performance of the entire pipeline.

You canrefer to the sample code snippet to achieve this. OpenVINO enables HuBert, Bert, VALL-E, VITs 4 sub models, and builds the pipeline based on OpenVINO.

Reference GitHub project : 18582088138/GPT-SoVITS-OpenVINO:[OpenVINO Enable]1 min voice data can also be used to train a good TTS model! (few shot voice cloning) (


Export HuBert model code snippet

HuBert sub model definition

Fig.6-HuBert model definition

HuBert sub model convert

Fig.7-HuBert model convert

Export Bert model code snippet

Bert sub model definition and convert

Fig.8-Bert model definition and convert


Export VALL-E model code snippet

Since t2sis defined as three modules in the source code: t2s_encoder, first_stage_decoder, stage_decoder, refer to its source code implementation and convert these three modules into corresponding IR models through OpenVINO.
However, if the generated class model can refer to the Transformer basic model for KV cache optimization, the performance of the t2s model will be further improved, but this requires some code development work, and I will add this function in subsequent work (To Do).

VALL-E sub model definition

Fig.9-VALL-E model definition

VALL-E sub model convert

Fig.10-VALL-E model convert

Export VITs sub model code snippet

In the VITs model, the spectrogram torch function has two operators "torch.hann_window" and "torch.stft". Currently, OpenVINO opset does not support these two operators. Therefore, this part temporarily needs to use the operators in torch opset and redefine the original VITs sub-model. After the subsequent operators are successfully supported, OpenVINO will re-enable the VITs model (To Do).

VITs sub model definition

Fig.11-VITs model definition

VITs sub model convert

Fig.12-VITs model convert


Finally, we rebuilt the GPT-SoVITs pipeline through the four sub models enabled by OpenVINO. You can refer to the implementation of the pipeline:

Reference link : GPT-SoVITS-OpenVINO/GPT_SoVITS/ at 18582088138/GPT-SoVITS-OpenVINO(

No items found.

MiniCPM-V-2 model enabling with OpenVINO


MiniCPM is an End-Side LLM developed by ModelBest Inc. and TsinghuaNLP. MiniCPM-V is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. The models take image and text as inputs and provide high-quality text outputs. MiniCPM-V 2.0 is an efficient version with promising performance for deployment. The model is built based on SigLip-400Mand MiniCPM-2.4B, connected by a perceiver resampler. On this blog, we provide the OpenVINO™ optimization for MiniCPM-V 2.0 on Intel® platforms.

You can find more information on GitHub repository:

OpenVINO™backend on Minicpm-V-2

Step 1: Install system dependency and setup environment

Create and enable python virtual environment

conda create -n ov_py310 python=3.10 -y
conda activate ov_py310

Clone the MiniCPM-V repository from GitHub

git clone -b ov_runtime

Chage the current directory to the MiniCPM-V OpenVINO™ Runtime folder

 cd MiniCPM-V/eval_mm/openvinoruntime/

Install python dependency

pip install -r requirement.txt 

Step2: Export to OpenVINO™ models

python -m /path/to/ openbmb/MiniCPM-V-2 -o /path/to/ MiniCPM-V-2 _ov

Step3: Simple inference test with OpenVINO™

python -m /path/to/ MiniCPM-V-2 _ov -pic /path/to/hk_OCR.jpg -p “Describe the content of the image” 

Question: Describe the content of the image



The image captures the vibrant and bustling atmosphere of abusy city street in Hong Kong. The street is lined with an array of neon signsand billboards, each one advertising a different business or establishment. Thesigns are in a variety of languages, including English, Chinese, and Japanese,reflecting the multicultural nature of the city.

The street itself is a hive of activity with several busesand a tram making their way through the traffic. The vehicles are in motion,adding a dynamic element to the scene.

The sky above is a beautiful gradient of colors,transitioning from a deep blue at the top to a lighter shade at the bottom.This suggests that the photo was taken during either sunrise or sunset, castinga warm glow over the cityscape.

The image also contains several text elements, including thenames of various establishments and the brand names of products. These textsadd another layer of information to the scene, providing insights into thenature of the businesses and the products they offer.

Overall, the image provides a vivid snapshot of life in HongKong, capturing the city's vibrant energy and the diverse range of businessesand products that make up its bustling streets.


Enable Personalized Text-to-Speech Pipeline with SAMBERT-HifiGAN via OpenVINO Python API

July 5, 2024

Authors: Tianmeng Chen, Xiake Sun, Fiona Zhao, Su Yang


Personalized Speech Synthesis is the process of using some recording devices around you to record certain voice clips of a particular person, and then letting Text-To-Speech (TTS) technology synthesize the voice, manner of speaking, and emotion of a particular person.  SAMBERT-HifiGAN is a complete personalized TTS solution designed by Alibaba Damo Institute, which includes the first part of SAMBERT's acoustic model and the second part of the HifiGAN vocoder.

Structure of SAMBERT
Structure of HifiGAN

In this blog, we will introduce how to utilize OpenVINOTM Python API to enable the SAMBERT-HifiGAN pipeline. All the project code can be found here.

KAN-TTS by Ali provides a tutorial for training SAMBERT-HifiGAN. A pipeline for personalized speech synthesis based on PyTorch is provided on modelscope, what we will do here is toreplace the PyTorch based part of it with OpenVINOTM. It is worth noting that due to some of the operators in the model, there are some modules that cannot be replaced with OpenVINOTM Python API.


Since we need to make changes on the pipeline based on PyTorch backend, the first thing we need to do is to download the KAN-TTS source code and successfully run through the pipeline to get the inputs and outputs of the model as well as the state of the middle layer. Of course we also need the OpenVINOTM environment.

  1. Get the KAN-TTS source code and create anacondaenvironment.
git clone -b develop
pip config set global.index-url
conda env create -f environment.yaml
conda activate maas
  1. Then we install openvino in same environment. Ifyou want specific version of OpenVINOTM, you can install it byyourself through Install OpenVINO™.
pip install openvino
  1. Follow the KAN-TTS practice tutorial of official with readme in ModelScope.

          After you finish the pipelining of KAN-TTS, you can get the res folder and ckpt filesspeech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k .

  1. Get the OpenVINOTM backend projectsource code and copy the res folder to project folder.
git clone  
cp -r $KAN-TTS_PATH/res  $SambertHifigan_OV/

Convert torch modelto openVINOTM model

Converting a torch model to OpenVINOTM requires model inputs. So we usetest.txt as input of SAMBERT and use the res folder as input of HifiGAN.

python kantts/bin/ --txt test.txt --output_dir res/test_male_ptts_syn --res_zip speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/ --am_ckpt speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/pretrain_work_dir/tmp_am/ckpt/checkpoint_2400200.pth --voc_ckpt speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/pretrain_work_dir/orig_model/basemodel_16k/hifigan/ckpt/checkpoint_2400000.pth  --se_file speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/pretrain_work_dir/data/se/se.npy --is_ov_convert

Aftera few minutes, you will get two converted OpenVINOTM model sambert_encoder.xml sambert_encoder.bin and hifigan_t.xml hifigan_t.bin.

In the code after we load the model and get the inputs, we add the following code to convert the loaded PyTorch backend model to OpenVINOTM backend model and save it.

Example code of SAMBERT pipeline
Example code of HifiGAN pipeline

Run the inferencewith OpenVINOTM model

Before running the inference, the res folder should be renamed to allow for comparisons later.

mv res res_pytorch

then run the command below.

python kantts/bin/ --txt test.txt --output_dir res/test_male_ptts_syn --res_zip speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/ --am_ckpt speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/pretrain_work_dir/tmp_am/ckpt/checkpoint_2400200.pth --voc_ckpt speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/pretrain_work_dir/orig_model/basemodel_16k/hifigan/ckpt/checkpoint_2400000.pth  --se_file speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/pretrain_work_dir/data/se/se.npy

After a few minutes, you will get the wav file in res/test_male_ptts_syn. For example in test.txt we write a random sentence:

After running pipeline, we will get a 7 seconds wav file under res folder:

In the code we modified the original pytorch banckend inference code so that pipeline uses openvino backend for inference.

Example code of SAMBERT pipeline
Example code of HifiGAN pipeline


This blog describes about how to run the SAMBERT-HifiGANpipeline using the OpenVINOTM Python API, please see the source code formore details and modifications.


OpenVINO GenAI Serving (OGS)

July 4, 2024

Authors: Fiona Zhao, Xiake Sun, Wenyi Zou, Su Yang, Tianmeng Chen

Model Server reference implementation based on OpenVINO GenAI Package for Edge/Client AI PC Use Case.

openvino.genai/samples/cpp/rag_sample at openvino_genai_serving · sammysun0711/openvino.genai (

figure: architecture

Use Case 1: C++ RAG Sample that supports most popular models like LLaMA 2

This example showcases for Retrieval-Augmented Generation based on text-generation Large Language Models (LLMs): chatglm, LLaMA, Qwen and other models with the same signature and Bert model for embedding feature extraction. The sample fearures ov::genai::LLMPipeline and configures it for the chat scenario. There is also a Jupyter notebook which provides an example of LLM-powered RAG in Python.

Download and convert the model and tokenizers

The --upgrade-strategy eager option is needed to ensure optimum-intel is upgraded to the latest version.

python3 -m pip install --upgrade-strategy eager -r ../../requirements.txt
optimum-cli export openvino --trust-remote-code --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 TinyLlama-1.1B-Chat-v1.0

Setup of PostgreSQL, Libpqxx and Pgvector

Langchain's document Loader and Spliter

  1. Load: document_loaders is used to load document data.
  2. Split: text_splitter breaks large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won’t in a model’s finite context window.


Download postgresql from enterprisedb.(postgresql-16.2-1-windows-x64.exe is tested)

Install PostgreSQL with postgresqltutorial.
Setup of PostgreSQL:
1. Open pgAdmin 4 from Windows Search Bar.
2. Click Browser (left side) > Servers > Postgre SQL 10.
3. Create the user postgres with password openvino (or your own setting)
4. Open SQL Shell from Windows Search Bar to check this setup. 'Enter' to set Server, Database, Port, Username as default and type Password.

Server [localhost]: 
Database [postgres]:
Port [5432]:
Username [postgres]:
Password for user postgres:


'Official' C++ client library (language binding), built on top of C library

Update the source code from in deps\libpqxx

The pipeline connects with DB based on Libpqxx.


Open-source vector similarity search for Postgres.

By default, pgvector performs exact nearest neighbor search, which provides perfect recall. It also supports approximate nearest neighbor search (HNSW), which trades some recall for speed.

For Windows, Ensure C++ support in Visual Studio 2022 is installed, then use nmake to build in Command Prompt for VS 2022(run as Administrator). Please follow with the pgvector

Enable the extension (do this once in each database where you want to use it), run SQL Shell from Windows Search Bar with "CREATE EXTENSION vector;".

Printing CREATE EXTENSION shows successful setup of Pgvector.


pgvector support for C++ (supports libpqxx). The headers (pqxx.hpp, vector.hpp, halfvec.hpp) are copied into the local folder rag_sample\include. Our pipeline does the vector similarity search for the chunks embeddings in PostgreSQL, based on pgvector-cpp.

Install OpenVINO, VS2022 and Build this pipeline

Download 2024.2 release from OpenVINO™ archives*. This OV built package is for C++ OpenVINO pipeline, no need to build the source code. Install latest Visual Studio 2022 Community for the C++ dependencies and LLM C++ pipeline editing.

Extract the zip file in any location and set the environment variables with dragging this setupvars.bat in the terminal Command Prompt. setupvars.ps1 is used for terminal PowerShell. <INSTALL_DIR> below refers to the extraction location. Run the following CMD in the terminal Command Prompt.

git submodule update --init
cd openvino.genai
cmake -S .\ -B .\build\ && cmake --build .\build\ --config Release -j8
cd .\build\samples\cpp\rag_sample\Release


  • Install on Windows: Copy all the DLL files of PostgreSQL, OpenVINO and tbb and openvino-genai into the release folder. The SQL DLL files locate in the installed PostgreSQL path like "C:\Program Files\PostgreSQL\16\bin".
  • If cmake not installed in the terminal Command Prompt, please use the terminal Developer Command Prompt for VS 2022 instead.
  • The openvino tokenizer in the third party needs several minutes to build. Set 8 for -j option to specify the number of parallel jobs.
  • Once the cmake finishes, check rag_sample_client.exe and rag_sample_server.exe in the relative path .\build\samples\cpp\rag_sample\Release.
  • If Cmake completed without errors, but not find exe, please open the .\build\OpenVINOGenAI.sln in VS2022, and set the solution configuration as Release instead of Debug, then build the llm project within VS2022 again.


Launch RAG Server

rag_sample_server.exe --llm_model_path TinyLlama-1.1B-Chat-v1.0 --llm_device CPU --embedding_model_path bge-large-zh-v1.5 --embedding_device CPU  --db_connection "user=postgres host=localhost password=openvino port=5432 dbname=postgres"

Lanuch RAG Client


Lanuch python Client

Use python client to send the message of DB init and send the document chunks to DB for embedding and storing.

python --docs


SDPA Enabling for Custom Model

July 4, 2024

Authors: Su Yang, Xiake Sun, Fiona Zhao


To enable the SDPA fusion on GPU, we firstly need to convert model IR with SDPA Op.

Create new class SdpaAttention in the modeling_ using torch.scaled_dot_product_attention. This Pytorch Op could be matched and converted into OpenVINO SDPA Op.

Fig1: from Attention Is All You Need

Refer to Phi3SdpaAttention, this module is inherited from Phi3Attention as the weights of the module stay untouched. The only changes are on the forward pass to adapt to SDPA API. 


From the equivalent implementation, the target is to replace the related Pytorch Ops(like softmax, matmul and dropout) with torch.nn.functional.scaled_dot_product_attention.

Fig2: implementation equivalent from PyTorch 2.3 documentation

For some custom model, the equivalent code in is as follow:

attn_weights = torch.matmul(query_states, key_states.transpose(2, 3))
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=query_states.dtype)
attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout_rate,
attn_output = torch.matmul(attn_weights, value_states) 

The corresponding implementation of scaled_dot_product_attention:

attn_output = torch.nn.functional.scaled_dot_product_attention(
            dropout_p=self.attention_dropout_rate if else 0.0, 
            is_causal= True and attention_mask is None and q_len > 1, 

SDPA’s Scaling factor with different head_size

For the Pytorch model with different head_size, this scaled_dot_product_attention need to be modified with the scaling factor.

attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
attn_output = torch.nn.functional.scaled_dot_product_attention(
            dropout_p=self.attention_dropout_rate if else 0.0, 
            is_causal= True and attention_mask is None and q_len > 1, 

The SDPA Operator has already implemented. The SDPA fusion on GPU supports for head size =128 at OV24.2 release.

The OV24.3 release relax SDPA head size limitations for LLMs from 128 only to a range of 64 to 256, thanks to Sergey’s PR.


Replace the original with the new script (with SdpaAttention) in the Pytorch model folder.  


  • After converting model again, check the SDPA layers (“aten::scaled_dot_product_attention”) in the OV IR .xml file.
  • Double check the OpenVINO executable graph for the SDPA enabling.  
  • Don’t forget to check the accuracy of the Pytorch model inference, after any modification with


In this blog, we introduce how to use torch.scaled_dot_product_attention to enable the SDPA for custom model.

Performance improvement with SDPA on MTL iGPU is depended on the model structure. SDPA enabling for the custom model is the base for further optimization like Page Attention.


Q2'24: Technology Update – Low Precision and Model Optimization


Alexander Kozlov, Nikita Savelyev, Vui Seng Chua, Souvikk Kundu, Nikolay Lyalyushkin,  Andrey Anufriev, Pablo Munoz, Alexander Suslov, Liubov Talamanova, Yury Gorbachev, Nilesh Jain, Maxim Proshin


This quarter we see an increasing interest in KV-cache optimization of Large Language and Vision Models. This actually expected as KV-cache is getting a bottleneck after the weight compression problem is solved to some degree. We also believe that KV-cache optimization will continue being a hot topic as it is also involved in the Video Generations scenario where we see a lot of work going on nowadays.


  • QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving by MIT, NVIDIA, UMass Amherst, MIT-IBM Watson AI Lab ( A regular work from Song Han Lab which is a comprehensive study of deep LLM optimization and a reference design of a tool for LLM serving. The LLM optimization part includes: W4A8 and 4-bit KV-cache quantization approach; Progressive quantization of weights, to comply with 8-bit compute after dequantizing4-bit weights to 8-bits; SmoothAttention method, to reduce the error of 4-bit quantization of Key cache that is compatible with RoPE operation and can be fused into a preceding Linear layer; Progressive quantization of weights, to comply with 8-bit compute after dequantizing4-bit weights to 8-bits. The inference part contains tips and tricks to design efficient inference kernels and execution pipelines on the Nvidia GPUs. The method shows superior results comparing to competitive solutions and demonstrates the ability to substantially reduce LLM serving costs. Some code and pre-compiled binaries are available here:
  • ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification by Houmo AI and Chinese universities ( present a KV cache quantization method for LLMs. First, they construct a strong baseline for quantizing KV cache. Through the proposed channel-separable token-wise quantization scheme, the memory overhead of quantization parameters is substantially reduced compared to fine-grained group-wise quantization. To enhance the compression ratio, they propose a normalized attention score. The quantization bit-width for each token is adaptively assigned based on their saliency. The authors also develop an approximation method that decouples the saliency metric from full attention scores compatible with FlashAttention. Experiments demonstrate that the method achieves good compression ratios at fast generation speed, for example, when evaluating Mistral-7B model on GSM8k dataset, the method is capable of compressing the KV cache by 4.98×,with only a 0.38% drop in accuracy.
  • BitsFusion: 1.99 bits Weight Quantization of Diffusion Model by Snap Inc. and Rutgers University ( The paper provides a thorough analysis of UNet weight-only quantization of Stable Diffusion 1.5 model. The authors propose an approach for mixed-precision quantization of diffusers. They quantize different layers into different bits according to their quantization error. The authors also introduce several techniques to initialize the quantized model to improve performance, including time embedding pre-computing and caching, adding balance integer, and alternating optimization for scaling factor initialization. Finally, they propose a two-stage Quantization-aware training where distillation is used at the first stage. The quantized model achieves very good results on various benchmarks. Code will be released here:
  • Applying t-Distributions to Explore Accurate and Efficient Format[KA1] s for LLMs by Cornell University and Google ( The paper investigates non-uniform quantization data formats by profiling the distributions of weight and activation across 30 models, including both LLM and non-LLM models. The authors discovered that Student’s t-Distribution is a better fit than the Gaussian distribution due to its flexible parameterization, which can resemble Gaussian, Cauchy, or other distributions observed indifferent neural networks. The authors derived Student Float (SF4) using a similar design process to Normal Float (NF4). SF4 outperforms NF4, FP4, and Int4 in accuracy retention across most cases and model architectures, making it a strong drop-in replacement for lookup-based datatypes like NF4. The paper proposes using SF4as a reference to extend supernormal support for existing datatypes like E2M1(one variant of FP4) and APoT4,by reassigning negative zero to a useful value, which is otherwise wasted. Additionally, the paper examines the Pareto frontier of datatypes in terms of model accuracy and MAC chip area, concluding that APoT4 and its supernormal extension are Pareto optimal for a set of models smaller than 7B parameters.
  • ShiftAddLLM: MatMul-free LLM via Inference Time Reparameterization by Intel, Google Deep Mind, Google, Georgia Tech ( developed an inference time reparameterization for traditional LLMs layers with MatMul ops to convert them to layers with Shift-Add and LUT query-based operations only. Specifically, authors quantize each weight matrix into binary matrices paired with group-wise scaling factors. The associated multiplications are reparameterized into (1) shifts between activations and scaling factors and (2) queries and adds according to the binary matrices. To reduce accuracy loss, they present a multi-objective optimization method to minimize both weight and output activation reparameterization errors. Additionally, based on varying sensitivity across layers to reparameterization, they develop an automated bit allocation strategy to further reduce memory usage and latency. The code is available at:


  • QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving by MIT, NVIDIA, UMass Amherst, MIT-IBM Watson AI Lab ( A regular work from Song Han Lab which is a comprehensive study of deep LLM optimization and a reference design of a tool for LLM serving. The LLM optimization part includes: W4A8 and 4-bit KV-cache quantization approach; Progressive quantization of weights, to comply with 8-bit compute after dequantizing4-bit weights to 8-bits; SmoothAttention method, to reduce the error of 4-bit quantization of Key cache that is compatible with RoPE operation and can be fused into a preceding Linear layer; Progressive quantization of weights, to comply with 8-bit compute after dequantizing4-bit weights to 8-bits. The inference part contains tips and tricks to design efficient inference kernels and execution pipelines on the Nvidia GPUs. The method shows superior results comparing to competitive solutions and demonstrates the ability to substantially reduce LLM serving costs. Some code and pre-compiled binaries are available here:
  • LQER: Low-Rank Quantization Error Reconstruction for LLMs by Imperial College London London and University of Cambridge ( The paper combines quantization and low-rank approximation techniques to achieve accurate and efficient LLM optimization. The method employs MXINT4 datatype (int4 + shared exponent for 4 elements) for weight quantization while quantizing activation into 8 or 6 bits with per-token scaling factors. The method also introduces 8-bit LoRA adapters to restore accuracy after weight quantization. It does not use any kind of fine-tuning. Instead, it introduces the error decomposition into two low-rank matrices. The method achieves very accurate results in W4A8 and W4A6 settings, especially on Llama-2 model family.
  • LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models by Beihang University, SenseTime Research, and Nanyang Technological University ( paper focuses on identifying the most effective practices for quantizing LLMs, with the goal of balancing performance with computational efficiency. Fora fair analysis, the authors develop a quantization toolkit LLMC and design four crucial principles considering the inference efficiency, quantized accuracy, calibration cost, and modularization. By benchmarking on various models and datasets with over 500 experiments, three takeaways corresponding to calibration data, quantization algorithm, and quantization schemes are derived. Finally, a best practice of LLM PTQ pipeline is constructed. All the benchmark results and the toolkit can be found at
  • SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models Houmo AI by Houmo AI and Chinese universities ( The paper addresses the problem of extremely low bit-width KV cache quantization. To achieve this, it proposes a method that rearranges the channels of the KV cache in order to improve the similarity of channels in quantization groups and applies clipped dynamic quantization at the group level. Additionally, the method ensures that the most recent window tokens in the KV cache are preserved with high precision. This helps maintain the accuracy of a small but important portion of the KV cache.  Evaluation on LLMs demonstrates that the method surpasses previous quantization approaches, allowing for quantization of the KV cache to 2-bit keys and 1.5-bit values with minimal loss of accuracy. The code is available at
  • Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs by Meituan ( The paper proposes a scheme to use integer scales when computing dot products of W4A8quantized LLMs. It allows keeping group scales for weights in the integer precision as well and using INT32 buffer as the accumulator of partial dot products. An additional floating point scale is required and applied to the super-group of dot products between weights and activations. It brings the proposed method close to the known double quantization approach. The paper provides extensive evaluation data for Llama2 and Llama3 models showing close results to the baseline floating-point scales.
  • Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs by Hanyang University ( paper aims at reducing the accuracy degradation of fully-quantized LLM models (both weights and activations are quantized). Authors propose two empirical methods, Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), to isolate the activation spikes during quantization that cause most of the accuracy drop. Essentially, they propose a way to identify what layers are more error-prone and keep these layers in the floating-point precision. The code is available at
  • AdpQ: A Zero-shot Calibration Free Adaptive Post Training Quantization Method for LLMs by Huawei Noah Lab and McGill University ( paper presents a novel zero-shot adaptive PTQ method for LLMs that does not require any calibration data. Inspired by Adaptive LASSO regression model, the authors proposed approach that tackles the challenge of outlier activations by separating salient weights using an adaptive soft-thresholding method. Guided by Adaptive LASSO, this method ensures that the quantized weights distribution closely follows the originally trained weights and eliminates the need for calibration data entirely. The method achieves good results at much faster quantization time.
  • PTQ4SAM: Post-Training Quantization for Segment Anything by Beihang University ( practical study on quantization of the Segment Anything model. The authors observe a challenging bimodal distribution for quantization and analyze its characteristics. To overcome it, they propose a Bimodal Integration (BIG)strategy, which automatically detects it and transforms the bimodal distribution to normal distribution equivalently. They also present the Adaptive Granularity Quantization which represents diverse post-Softmax distributions accurately with appropriate granularity. Experiments show that the method can achieve good results even in low-bit quantization settings (6 or4 bits). Code is available at
  • QNCD: Quantization Noise Correction for Diffusion Models by Kuaishou Technology ( Authors identify two primary quantization challenges for Duffusion models: intra and inter quantization noise. Intra quantization noise, exacerbated by embeddings in the resblock module, extends activation quantization ranges, increasing disturbances in each single denoising step. Besides, inter quantization noise stems from cumulative quantization deviations across the entire denoising process, altering data distributions step-by-step. Authors propose embedding-derived feature smoothing for eliminating intra quantization noise and a runtime noise estimation module for dynamically filtering inter quantization noise. Experiments demonstrate that the method achieves good results in W4A8 and W8A8 quantization settings on ImageNet (LDM-4). Code is available at:
  • SliM-LLM: Salience-DrivenMixed-Precision Quantization for Large Language Models by The ETH Zürich, University of Hong Kong, and Beihang University ( paper focuses on the problem of ultra-low bit weight quantization of LLMs. Specifically, it proposes the method relies on two novel techniques: (1)Salience-Determined Bit Allocation utilizes the clustering characteristics of salience distribution to allocate the bit-widths of each quantization group. This increases the accuracy of quantized LLMs and maintains the inference efficiency high; (2) Salience-Weighted Quantizer Calibration optimizes the parameters of the quantizer by considering the element-wise salience within the group. The method is evaluated in two setups for quantization parameters tuning: greedy search and gradient based search. Evaluation shows good results on Llama 1/2/3 models. Code is available at
  • LCQ: Low-Rank Codebook based Quantization for Large Language Models by Nanjing University ( The paper proposes a method for LLM optimization using customized low-ranking codebooks the rank of which can be larger than one, for quantization. A gradient-based optimization algorithm is proposed to optimize the parameters of the codebook. The method also adopts a double quantization strategy for compressing the parameters of the codebook, which can reduce the storage cost of the codebook. Experiments show that achieves better accuracy than existing methods with a negligibly extra storage cost.
  • P2 -ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer by Nanjing University and Sun Yat-sen University ( The paper introduces a Power-of-Two (PoT) post-training quantization and acceleration framework for ViT models. The authors analyze ViTs’ properties and develop a dedicated quantization scheme. This scheme incorporates techniques such as adaptive PoT rounding and PoT Aware smoothing, allowing for the efficient quantization of ViTs with PoT scaling factors. By doing this, computationally expensive floating-point multiplications and divisions with in the re-quantization process can be traded with hardware-efficient bitwise shift operations. Furthermore, we introduce a coarse-to-fine automatic mixed-precision quantization methodology for better accuracy-efficiency trade-offs. Finally, authors build a dedicated accelerator engine to better everage our algorithmic properties for enhancing hardware efficiency. Code is available at:
  • QJL: 1-Bit Quantized JLTransform for KV Cache Quantization with Zero Overhead by New York University and Adobe Research ( paper studies problems of KV-cache quantization of LLMs, specifically the Key part as it is more error-prone when lowering its precision. Authors propose an approach that consists of a Johnson-Lindenstrauss (JL) transform followed by sign-bit quantization for Key cache. They introduce an asymmetric estimator for the inner product of two vectors and demonstrate that applying the method to one vector and a standard JL transform without quantization to the other provides an unbiased estimator with minimal distortion.  They also developed a CUDA-based implementation for optimized computation. When applied across various LLMs and NLP tasks to quantize the KV cache to only 3 bits, the method demonstrates a more than fivefold reduction in KV cache memory usage without an insignificant accuracy drop. Codes will be available at
  • ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation by Tsinghua University, Infinigence AI, 3Microsoft, and Shanghai Jiao Tong University ( paper tackles the problems of accurate quantization of diffusion vision transformer models. Essentially, authors apply dynamic 8-bit per-token quantization to activations. They also propose to smooth activation with a Smoothquant-like approach but with different α factors tuned to each iteration of the diffusion process. Finally, authors propose to select a per-layer weight bit-width (e.g.W4A8, W6A6, or W8A8) depending on the sensitivity and position of the layer in the Transformer block. All these tricks lead to very good accuracy results in the image and video generation tasks.
  • Instance-Aware Group Quantization for Vision Transformers by Yonsei University and Articron ( In this paper an approach for instance-aware group quantization for ViTs(IGQ-ViT) is introduced. According to the approach, channels of activation maps are dynamically split into multiple groups where each group has its own set of quantization parameters. Authors also extend their scheme to quantize softmax attentions across tokens. IGQ-ViT demonstrates superior accuracy results across image classification, object detection and instance segmentation task. Authors claim that performance overhead induced by dynamic quantization is no more than 4% compared to layer-wise quantization.
  • Reg-PTQ: Regression-specialized Post-training Quantization for Fully Quantized Object Detector by Beihang University ( In this paper authors explore full quantization of object detection models contrary to most existing approaches which quantize only detection backbones and keep detection head in original precision. Based on the findings, the reason behind poor quantization of detector heads is that they are optimized to solve regression tasks. Specifically, authors argue that (1) regressors are more sensitive to perturbation compared to classifiers, (2) minimizing quantization error does not necessarily result in optimal scaling factors for regressor and(3) regressors weights follow non-uniform distribution contrary to classifiers. To tackle these problems a novel Reg-PTQ method is introduced. Based on the results it achieves 7.6x and 5.4x reduction in computation and storage consumption under INT4 precision with little performance degradation.
  • Towards Accurate Post-training Quantization for Diffusion Models ( In this paper authors propose a method for accurate post-training quantization of diffusion models. The main idea is to split diffusion timesteps for each layer into groups where each group corresponds to its own set of quantization parameters. Such split is obtained by minimizing some optimization objective on a calibration dataset.  Besides this, a special timestep selection method is employed for sampling timesteps for calibration. Overall, the method demonstrates superior generation quality results over such baselines as LSQ, PTQ4DM and Q-Diffusion.


  • Effective Interplay between Sparsity and Quantization: From Theory to Practice by Google and EcoCloud ( Authors provide the theoretical analysis of how sparsity and quantization interact. Mathematical proofs establish that applying sparsity before quantization (S → Q) is the optimal sequence for compression. Authors demonstrate that sparsity and quantization are not orthogonal operations. Combining them introduces additional errors beyond the sum of their individual errors. They validate theoretical findings through experiments covering a diverse range of models, including prominent LLMs (OPT, LLaMA) and ViTs. The code will be published at:
  • Prompt-prompted Mixture of Experts for Efficient LLM Generation by CMU ( Authors introduce GRIFFIN, a training-free MoE that selects unique FF experts at the sequence level for efficient generation across a plethora of LLMs with different non-ReLU activation functions. This is possible due to a critical observation that many trained LLMs naturally produce highly structured FF activation patterns within a sequence, which we call flocking. Despite the method’s simplicity, it shows with 50% of the FF parameters, GRIFFIN maintains the original model’s performance with little to no degradation on a variety of classification and generation tasks, all while improving latency (e.g. 1.25× speed-up in Llama 213B on an NVIDIA L40). Code is available at
  • Sparse maximal update parameterization: A holistic approach to sparse training dynamics by Cerebras Systems ( paper addresses the common issue in sparse training where hyper parameters from dense training are reused, leading to suboptimal convergence, and requiring extensive tuning for different sparsity ratios. The researchers introduce a novel sparse training methodology called Sparse Maximal Update Parameterization (SuPar), which extends the maximal update parameterization (uP)to sparse training. SuPar involves reparameterizing (see Table 1) weight initialization and learning rates relative to changes in sparsity, effectively preventing exploding or vanishing signals and maintaining stable activation, gradient, and weight update scales across varying sparsity levels and model widths. SuPar reparameterization is remarkable, it allows zero-shot hyperparameter transfer, i.e. practitioners can now tune small proxy models(dense/sparse) and transfer optimal HPs directly to models at scale for any model sparsity, thus enhancing the efficiency and reducing the cost of sparse model development. Experiments demonstrate that SμPar sets the Pareto frontier best loss across all sparsities and widths, including large dense model with width equal to GPT-3 XL.
  • Sparse Expansion and Neuronal Disentanglement by MIT, IST Austria, Neural Magic ( Sparse Expansion is an approach of converting dense LLMs to mixture of sparse experts to attain inference efficiency. The method begins with applying dimensionality reduction (PCA) on the inputs of FFN linear layers, followed by a k-means clustering. The intuition is that tokens within a cluster share a sparse expert better without significant distortion. SparseGPT is then used to create a sparse expert for each cluster group. During inference, the PCA and k-means models act as routers, directing tokens to the appropriate sparse expert based on their cluster. While this increases the overall model size, acceleration is achieved through the conditional execution of experts and the sparse execution of these experts, with minimal cost for the routers. The paper includes layer-wise speedup benchmarks and shows that Sparse Expansion outperforms other one-shot sparsification approaches in perplexity for the same inference FLOP budget per token. A significant portion of the paper is dedicated to the concept of neuron entanglement, explaining, and quantifying the efficacy of sparse expansion.
  • MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning by University of Trento and Cisco Research ( Authors highlight that existing techniques for pruning of Visual-Language models(VLMs) are task-specific and propose a task-agnostic method for pruning VLMs. The proposed Multimodal Flow Pruning framework has the following properties: (1) the importance of a weight is computed based on saliency of the neurons it connects; and (2) parameters are pruned considering features of which modality they are used to compute allowing to avoid pruning too much from a single modality and too little from another. Experiments show that the proposed MULTIFLOW method outperforms recent more sophisticated competitors.

Other methods

  • Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation by Jasper Research ( The paper proposes a LoRA-compatible distillation method aiming at reducing the number of sampling steps required to generate high-quality samples from a trained diffusion model. Authors emphasize the versatility of the method through an extensive experimental study across various tasks (text-to-image, image inpainting, super-resolution, face-swapping), diffusion model architectures (SD1.5, SDXL and Pixart-α) and illustrate its compatibility with adapters. The method is relatively lightweight and can optimize SD1.5 model with 2 Nvidia H100 80GB with 13 hours of fine-tuning. Code is available at
  • GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection by California Institute of Technology, Meta AI, University of Texas at Austin, and Carnegie Mellon University ( The paper introduces a Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. The idea is to use PCA after a number of training steps to obtain a gradient projection matrix and use it to get a low-rank gradient matrix that is used for weights update. The approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training. 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. It demonstrates the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory. The code is available at:
  • MiniCache: KV Cache Compression in Depth Dimension for Large Language Models by ZIP Lab of Monash and Zhejiang University ( authors propose a training-free KV cache compression technique by merging KV tokens across every two consecutive transformer layers, based on the observation that KV tokens are highly similar across depth, especially from the middle to the last transformer layers. Specifically, a pair of K/V projections from two consecutive layers can be encoded into respective scaling factors and a shared directional vector computed via Spherical Linear Interpolation(SLERP). To address the information loss from merging dissimilar tokens, the algorithm uses angular-based distance to filter KV positions for retention. The algorithm is straightforward, involving calibration of only two hyperparameters, and it has demonstrated to enhance a 4X compressed KV cache by4-bit quantization to over 5X compression while retaining reasonable accuracy of instruction-tuned Mistral, LLama2-7B across benchmarks.
  • Scalable MatMul-free Language Modeling by University of California, Soochow University, LuxiTech ( Authors develop a MatMul-free language model by using additive operations in dense layers and element-wise Hadamard products for self-attention-like functions. Specifically, ternary weights eliminate MatMul in dense layers, similar to BNNs. To remove MatMul from self-attention, they optimize the Gated Recurrent to rely solely on element-wise products and show that this model competes with state-of-the-art Transformers while eliminating all MatMul operations. To quantify the hardware benefits of lightweight models, the authors provide an optimized GPU implementation in addition to a custom FPGA accelerator. By using fused kernels in the GPU implementation of the ternary dense layers, training is accelerated by 25.6% and memory consumption is reduced by up to 61.0% over an unoptimized baseline on GPU. Furthermore, by employing lower-bit optimized CUDA kernels, inference speed is increased by 4.57 times, and memory usage is reduced by a factor of 10 when the model is scaled up to 13B parameters. The code is available at
  • Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding by Hong Kong Polytechnic, Peking University, Microsoft Research Asia and Alibaba ( While LLMs are proliferating over the past two years, Speculative Decoding (SD) has emerged as a crucial paradigm to accelerate autoregressive generation. This survey is among the first to provide a comprehensive introduction and overview of the state of the art in SD, highlighting key developments in this space. A main contribution of this work is the introduction of Spec-Bench, a unified benchmark for evaluating SD methods across standardized subtasks such as multi-turn conversation, summarization, RAG, translation, question answering, and mathematical reasoning. The codes and benchmarks for various SD methods on RTX 3090 and A100 GPUs are accessible for further exploration and validation.
  • Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism by Meituan and Meta AI ( paper introduces an early-exiting framework for generating draft tokens, which allows a single LLM to fulfill the drafting and verification stages. The model is trained using self-distillation. The authors conceptualize the generation length of draft tokens as a multi-armed bandit problem and propose a control mechanism based on Thompson Sampling, which leverages sampling to devise an optimal strategy. They conducted experiments on three benchmarks and showed that the method can significantly improve the model’s inference speed.
  • LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding by Meta, University of Toronto, Carnegie Mellon University, University of Wisconsin-Madison, Dana-Farber Cancer Institute ( research the idea of early exit in LLMs for speculative decoding. First, during training, they apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit. Second, during inference, they show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model. Third, they present a self-speculative decoding solution where we exit at early layers and verify and correct with remaining layers of the model. They run experiments on different Llama model sizes on different types of training: pretraining from scratch, continual pretraining, finetuning on specific data domain, and finetuning on specific task, and show speedups of up to 2.16× on summarization for CNN/DM documents, 1.82× on coding, and 2.0× on TOPv2 semantic parsing task.


  • INT4 Decoding GQA CUDA Optimizations for LLM Inference by Meta( authors provide a comprehensive study and ten practical steps, including KV-cache quantization, to improve the performance of Grouped-query Attention. All these optimizations result in performance improvements of up to 1.8x on the NVIDIA A100 GPU and 1.9x on the NVIDIA H100 GPU.
  • torchao: PyTorch Architecture Optimization by Meta ( PyTorch library for quantization and sparsity. Currently-available features contain full models quantization, INT8, INT4, MXFP4,6,8 weight-only quantization and efficient model fine-tuning with GaLore method.
  • Introducing Apple’s On-Device and Server Foundation Models by Apple ( Apple has established a set of pre-trained and optimized models for its HW. The claim is that 3B LLM model can be run at 30t/s on iPhone 15 Pro. In terms of optimizations that are being used, authors claim weight palletization to 2 and 4 bits, quantization of embeddings and activations and efficient Key-Value (KV) cache update. They use their own AXLearn  library built on top of JAX  and XLA for model pre-training and fine-tuning.
  • BitBLAS by Microsoft ( library to support mixed-precision BLAS operations on GPUs. BitBLAS aims to support efficient mixed-precision DNN model deployment, especially the quantization in large language models (LLMs), for example, the 𝑊4𝐴16 in GPTQ, the 𝑊2𝐴16 in BitDistiller, the 𝑊2𝐴8 in BitNet-b1.58.