OpenVINO™ Enable PaddlePaddle Quantized Model

OpenVINO™ is a toolkit that enables developers to deploy pre-trained deep learning models through a C++ or Python inference engine API. The latest OpenVINO™ has enabled the PaddlePaddle quantized model, which helps accelerate their deployment.

From floating-point model to quantized model in PaddlePaddle

Baidu releases a toolkit for PaddlePaddle model compression, named PaddleSlim. The quantization is a technique in PaddleSlim, which reduces redundancy by reducing full precision data to a fixed number so as to reduce model calculation complexity and improve model inference performance. To achieve quantization, PaddleSlim takes the following steps.

  1. Insert the quantize_linear and dequantize_linear nodes into the floating-point model.
  2. Calculate the scale and zero_point in each layer during the calibration process.
  3. Convert and export the floating-point model to quantized model according to the quantization parameters.

As the Figure1 shows, Compared to the floating-point model, the size of the quantized model is reduced by about 75%.

Figure1 PaddlePaddle quantized model storage size

Enable PaddlePaddle quantized model in OpenVINO™

As the Figure2.1 shows, paired quantize_linear and dequantize_linear nodes appear intermittently in the model.

Figure2.1. PaddlePaddle quantized model with quantize_linear and dequantize_linear nodes

In order to enable PaddlePaddle quantized model, both quantize_linear and dequantize_linear nodes should be mapped first. And then, quantize_linear and dequantize_linear pattern scan be fused into FakeQuantize nodes and OpenVINO™ transformation mechanism will simplify and optimize the model graph in the quantization mode.

Figure2.2 Map the PaddlePaddle quantization nodes in OpenVINO™

To check the kernel execution function, just profile and dump the execution progress, you can use benchmark_app as an example. The benchmark_app provides the option"-pc", which is used to report the performance counters information.

  • To report the performance counters information of PaddlePaddle resnet50 float model, we can run the command line:
./benchmark_app -m resnet50_vd_infer/inference.pdmodel -data_shape "[1,3,224,224]"-pc -pcsort sort
Figure2.3 CPU profiling with resnet50_vd_infer
  • To report the performance counters information of PaddlePaddle resnet50 quantized model, we can run the command line:
./benchmark_app -m resnet50_vd_ptq/inference.pdmodel -data_shape "[1,3,224,224]"-pc -pcsort sort
Figure2.4 CPU profiling with resnet50_vd_ptq

By comparing the Figure2.3 and Figure2.4, we can easily find that the hotpot layers of PaddlePaddle quantized model are dispatched to integer ISA implementation, which can accelerate the execution.

Accuracy

We compare the accuracy between resnet50 floating-point model and post training quantization(PaddleSlim PTQ) model. The accuracy of PaddlePaddle quantized model only decreases slightly, which is expected.

model top1 top5
resnet50_vd_infer 0.7912 0.9445
resnet50_vd_ptq 0.7875 0.94046

Performance

Throughput Speedup

The throughput of PaddlePaddle quantized resnet50 model can improve >3x.

Figure3.1 SpeedUp of throughput between PDPD resnet50 float model and quantized model

Latency Speedup

The latency of PaddlePaddle quantized resnet50 model can reduce about 70%.

Figure3.2 SpeedUp of latency between PDPD resnet50 float model and quantized model

Conclusion

In this article, we elaborated the PaddlePaddle quantized model in OpenVINO™ and profiled the accuracy and performance. By enabling the PaddlePaddle quantized model in OpenVINO™, customers can accelerate both throughput and latency of deployment easily.

Notices & Disclaimers

  1. The accuracy data is collected based on 50000 images of val dataset in ILSVRC2012.
  2. The throughput performance data is collected by benchmark_app with data_shape "[1,3,224,224]" and hint throughput.
  3. The latency performance data is collected by benchmark_app with data_shape "[1,3,224,224]" and hint latency.
  4. The machine is Intel® Xeon® Gold 6346 CPU @3.10GHz.
  5. PaddlePaddle quantized model can be achieve at https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/en/quantize.md.