OpenVINO Optimization-LLM Distributed

No items found.

Author : Kunda,Xu / Zhai, Xiuchuan / Sun, Xiaoxia / Li, Tingqian / Shen, Wanglei

Introduction
With the continuous development of deep learning technology, large models have become key technologies in many fields, such as LLM, Multimodal, etc. The training of large models requires a lot of computing resources, and the reasoning and deployment of LLM also require a lot of resources, which promotes the widespread application of distributed parallel technology. Distributed parallel technology can distribute a large model to multiple computing nodes and accelerate the training and reasoning process of the model through parallel computing. It mainly includes strategies such as Model Parallel(MP), Data Parallel(DP), Pipeline Parallel(PP),  and Tensor Parallel(TP).

In this blog we will focus on OpenVINO's distributed optimization of LLM by tensor parallel.

Tensor parallel is a technique used to fit a large model in multiple device. For example, when multiplying the input tensors with the first weight tensor, the matrix multiplication is equivalent to splitting the weight tensor column-wise, multiplying each column with the input separately, and then concatenating the separate outputs. These outputs are then transferred from the CPUs/GPUs and concatenated together to get the final result.

Fig 1. Tensor Parallel Schematic

Taking the implementation of Fully Connect(FC) layer tensor parallel as an example, we introduce OpenVINO's distributed optimization of LLM. In the statistics of the inference time overhead of LLM tasks, we found that due to the large parameter size of LLM, the FC layer will take up a large time overhead. By optimizing the latency of the FC layer, the LLM first token/second token latency can be effectively reduced.

Fully connect layer refers to a neural network in which each neuron applies a linear transformation to the input vector through a weights matrix. As a result, all possible connections layer-to-layer are present, meaning every input of the input vector influences every output of the output vector.

In OpenVINO, SplitFC is used to split the FC layer into two device nodes to implement tensor parallelism.

SplitFC detects FC CPU operation with and without compressed weights. And then splits the FC into several small FCs by output channel according to sub stream number. The goal is that the executor can dispatch the split FCs to different numa nodes in the system. As a result, the split FCs can be executed at the parallel level.

Fig 2. Fully Connect Layer Split workflow

The following describes how to use the distributed features in OpenVINO to deploy the model on two CPU nodes to improve the performance of LLM.

Step 1. setup environment


$ python -m venv venv_ov_distributed
$ source venv_ov_distributed/bin/activate
$ pip install openvino-nightly

Step 2. clone openvino.genai source code and install requirement


$ git clone https://github.com/openvinotoolkit/openvino.genai
$ cd ./openvino.genai/ll_bench/python
$ pip install -r requirements.txt 

Step 3. prepare OpenVINO IR LLM model

The conversion script for preparing benchmarking models, convert.py allows to reproduce IRs stored on shared drive. Make sure the prerequisty requirements.txt has already install.


$ python convert.py --model_id "model_id_or_path" --output_dir "out_dir"

# ======= Usage example ==========

$ python convert.py --model_idmeta-llama/Llama-2-7b-chat-hf --output_dir models/llama-2-7b-chat

Step 4. LLM run pipe OpenVINO genai benchmark on single socket


$ python benchmark.py -m "model" -d "device" -r "report_csv" -f "framework" -p "prompt text" -n "num_iters"
# =========== e.g. =============
$ python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -n 2
$ python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -p "What is openvino?" -n 2
$ python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -pf prompts/llama-2-7b-chat_l.jsonl -n 2
Fig 3. Single socket LLM inference screenshot

Step 5. LLM run pipe OpenVINO genai benchmark distributed on double sockets


$ echo'{"MODEL_DISTRIBUTION_POLICY":"TENSOR_PARALLEL"}' > ov_distributed_config.json

$ python llm_bench/python/benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32  -pf  lm_bench/python/prompts/llama-2-7b-chat_l.jsonl -bs 1 -lc ov_distributed_config.json
Fig 4. Double socket LLM distributed inference screenshot

As can be seen from the figure above, nstream=1, nireq=1,LLM runs on two sockets, and from the printout of OV_config, we can see that MODEL_DISTRIBUED_POLICY has turned on the TENSOR_PARALLEL feature;

Using OpenVINO distributed feature can significantly reduce the first token/second token latency for general LLM models.