Optimize the Post-processing of GPT2 with PPP Custom Operation

No items found.

Here we provide a C++ demo about GPT2. Optimize the post-processing through the custom node of the OpenVINO PPP API, that is, insert some custom operations like TopK. Thereby reducing the latency of the pipeline.

GPT2 introduction

We use the GPT-2 model, which is a part of the Generative Pre-trained Transformer (GPT)family. GPT-2 is pre-trained on a large corpus of English text using unsupervised training. The model is available from HuggingFace. GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we can prime the model with an input and have it generate a lengthy continuation.

The following image illustrates complete demo pipeline used for this scenario:

image1: pipeline

The OpenVINO notebook provide the Python implementation.

To deploy the model, we provide a C++ implementation with the post-processing optimization.

C++implementation

GPT2 Tokenizer:

The C++ implementation of the GPT2 tokenizer is from the repo gpt2-cpp.  

OpenVINO Model:

The ONNX model is downloaded during the build process from the ONNX Model Zoo repo.

  • Download ONNX model with Git LFS
  • Use the python script to export the ONNX Model with ORT 1.10.
  • The model is 634 MB large, so it may take a while to download it.
  • Use python openvino-dev to convert ONNX model to OpenVINO IR model via "mo -m gpt2-lm-head-10.onnx -o IR_FP32"

OpenVINO PPP Custom Operation:

Post-processing steps can be added to model outputs. As for pre-processing, these steps will be also integrated into a graph and executed on a selected device. Compared to pre-processing, there are not as many operations needed for the post-processing stage.

Here we use the customize operations:

1. use Slice node to get the token size

2. use Slice node to get the new token tensor without input token

3. use TopK(k=20) node to get the values and indices

4. use Softmax for TopK values

5. use Concat node to combine the Softmax output and TopK indices

The following image shows the details:

image2: code snippet

Notice:

  • The Topk sampling of GPT-2 still needs a C++ post-process with probabilities distribution. It is difficult to implement this part with OpenVINO Operation of RandomUniform. .
  • The PPP custom operation is verified on CPU.

Installation

Download the source code from the repo.

Prepare the OpenVINO env and modify the cmake file.

Download and convert the ONNX model.

Usage:

cd build
./gpt2-generate -h
Usage:
  GPT2 [OPTION...]

  -t, --text arg    Initial text for GPT2
  -n, --number arg  Number of new words to generate from initial text
                    (default: 1)
  -p, --ppp         Optimize with PPP for topK
  -h, --help        Print usage
  

Run with PPP:

./gpt2-generate -t "If I have 100 million U.S. dollars, how can I maximize my returns in today's artificial intelligence world?" -n 128 -p
...
OV Prediction: "If I have 100 million U.S. dollars, how can I maximize my returns in today's artificial intelligence world? I don't know!"

The world has become more complex and more complex over the past few decades. There have been many changes to the nature of technology and what makes it successful. In many countries there are large-scale artificial intelligence systems, including a few in Germany, which may be able to do better than the human brain can. But as more and more people move from the use of artificial intelligence and computer-based artificial intelligence systems (AI) to more sophisticated AI and machine learning, they are also increasingly more dependent on human skills to perform their jobs. While we can all benefit from our own understanding of human intelligence and machine"  

Run without PPP:

./gpt2-generate -t "If I have 100 million U.S. dollars, how can I maximize my returns in today's artificial intelligence world?" -n 128
...
OV Prediction: "If I have 100 million U.S. dollars, how can I maximize my returns in today's artificial intelligence world? Well, what about the rest of the world? How could you optimize your returns with the same technology?

It turns out this approach isn't just limited to the big two markets. The other way to increase your return is by making it simpler. By starting with a single-purpose solution, such as a simple-to-use app, which you can integrate into any existing application, you get an additional layer of efficiency. For instance, instead of having to add new features every time you use it, there's the possibility that you'll start seeing better results every time you're using it. As a result, this approach makes"  

The optimization of PPP post-processing significantly improves the performance, which reduces the time of Intel CPU TGL by about 40% (the time of model inference plus post-processing).

The accuracy of GPT-2 C++ inference is intuitively like Python demo.