Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Sort By:

Techniques for faster AI inference throughput with OpenVINO on Intel GPUs

February 16, 2023

Authors: Mingyu Kim, Vladimir Paramuzov, Nico Galoppo

Intel’s newest GPUs, such as Intel® Data Center GPU Flex Series, and Intel® Arc™ GPU, introduce a range of new hardware features that benefit AI workloads. Starting with the 2022.3 release, OpenVINO™ can take advantage of two newly introduced hardware features: XMX (Xe Matrix Extension) and parallel stream execution. This article explains what those features are and how you can check whether they are enabled in your environment. We also show how to benefit from them with OpenVINO, and the performance impact of doing so.

What is XMX (Xe Matrix Extension)?

XMX is a hardware acceleration for matrix multiplication on the newest Intel™ GPUs. Given the same number of Xe Cores, XMX technology provides 4-8x more multiplication capacity at the same precision [1]. OpenVINO, powered by OneDNN, can take advantage of XMX hardware by accelerating int8 and fp16 inference. It brings performance gains in compute-intensive deep learning primitives such as convolution and matrix multiplication.

Under the hood, XMX is a well-known hardware architecture called a systolic array. Systolic arrays increase computational capacity without increasing memory (or register) access. The magic happens by pipelining multiple computations with a single data access, as opposed to the traditional fetch-compute-store pipeline. It is implemented by connecting multiple computation nodes in series. Data is fed into the front, goes through several steps of multiplication-add, and finally is stored back to memory.

How to check whether you have XMX?

You can check whether your GPU hardware (and software stack) supports XMX with OpenVINO™’s hello_query_device sample. When you run the sample application, it lists all detected inference devices along with its properties. You can check for XMX support by looking at the OPTIMIZATION_CAPABILITIES property and checking for the GPU_HW_MATMUL value.

In the listing below you can see that our system has two GPU devices for inference, and only GPU.1 has XMX support.

$ ./hello_query_device
[ INFO ] GPU.0
[ INFO ]                Immutable: OPTIMIZATION_CAPABILITIES : FP32 BIN FP16 INT8      
# XMX is not supported
[ INFO ] GPU.1
# XMX is supported

As mentioned, XMX provides a way to get significantly more compute capacity on a GPU. The next feature doesn’t provide more capacity, but it allows ways to use that capacity more efficiently.

What is parallel execution of multiple streams?

Another improvement of Intel®’s discrete GPUs is to process multiple compute streams in parallel. Certain deep learning inference workloads are too small to fill all hardware compute resources of a given GPU. In such a case it is beneficial to run multiple compute streams (or inference requests) in parallel, such that the GPU hardware has more work to process at any given point in time. With parallel execution of multiple streams, Intel GPUs can increase hardware efficiency.

How to check for parallel execution support?

As of the OpenVINO 2022.3 release, there is only an indirect way to query how many streams your GPU can process in parallel. In the next release it will be possible to query the range of streams using the ov::range_for_streams property query and the hello_query_device_sample. Meanwhile, one can use the benchmark_app to report the default number of streams (NUM_STREAMS). If the GPU does not support parallel stream execution, NUM_STREAMS will be 2. If the GPU does support it, NUM_STREAMS will be larger than 2. The benchmark_app log below shows that GPU.1 supports 4-stream parallel execution.

$ ./benchmark_app -d GPU.0 -m resnet-50.xml -t 1 --hint none
[ INFO ]   NUM_STREAMS: 2      # Single-stream execution is supported$ ./benchmark_app -d GPU.1 -m resnet-50.xml -t 1 --hint none
[ INFO ]   NUM_STREAMS: 4      # 4-stream execution is supported

However, it depends on application usage

Parallel stream execution can bring significant performance benefit, but only when used appropriately by the application. It will bring good performance gain if the application can run multiple independent inference requests in parallel, whether from single process or multiple processes. On the other hand, if there is no opportunity for parallel execution of multiple inference requests, then there is no gain to be had from multi-stream hardware execution.

Demonstration of performance tuning through benchmark_app

DISCLAIMER: The performance may vary depending on the system and usage.

OpenVINO benchmark_app is a very handy tool to analyze performance in various conditions. Here we’ll show the performance trend for an Intel® discrete GPU with XMX and four parallel hardware execution streams.

The performance was measured on a pre-production version of the Intel® Arc™ A770 Limited Edition GPU with 16 GiB of memory. The host system is a 12th Gen Intel(R) Core(TM) i9-12900K with 64GiB of RAM (4 DDR4-2667 modules) running Ubuntu OS 20.04.5 LTS with Linux kernel 5.15.47.

Performance comparison with high-level performance hints

Even though all supported devices in OpenVINO™ offer low-level performance settings, utilizing them is not recommended outside of very few cases. The preferred way to configure performance in OpenVINO Runtime is using performance hints. This is a future-proof solution fully compatible with the automatic device selection inference mode and designed with portability in mind.

OpenVINO benchmark_app exposes the high-level performance hints with the performance hint option for easy configuration of best latency and throughput. In short, latency mode picks the optimal configuration for low latency with the cost of low throughput, and throughput mode picks the optimal configuration for high throughput with the cost of high latency.

The table below shows throughput for various combinations of execution configuration for resnet-50.

HTML Table Generator
Network: resnet-50 int8 fp16 fp32
 Latency mode  Latency (ms)  2.07  2.35  4.22
 Throughput (FPS)  472.06  416.81  234.73
 Throughput mode  Latency (ms)  166.23 172.36  469.46 
 Throughput (FPS)  12263.22  5908.54  1077.68

Throughput mode is achieving much higher FPS compared to latency mode because inference happens with higher batch size and parallel stream execution.  You can also see that, in throughput mode, the throughput with fp16 is 5.4x higher than with fp32 due to the use of XMX.

In the experiments below we manually explore different configurations of the performance parameters for demonstration purposes; It is generally not recommended to tune manually. Once the optimal parameters are known, they can be applied in production.

Performance gain from XMX

Performance gain from XMX can be observed by comparing int8/fp16 against fp32 performance because OpenVINO does not provide an option to turn XMX off. Since fp32 computations are not executed by the XMX hardware pipe, but rather by the less efficient fetch-compute-store pipe, you can see that the performance gap between fp32 and fp16 is much larger than the expected factor of two.

We choose a batch size of 64 to demonstrate the best case performance gain. When the batch size is small, the performance difference is not always as prominent since the workload could become too small for the GPU.

$ ./benchmark_app -d GPU.1 -m resnet-50-fp.xml -t 10 --hint none --nstreams 4 -b 64 --infer_precision f32 | grep Throughput
[ INFO ] Throughput:          1076.22 FPS 
$ ./benchmark_app -d GPU.1 -m resnet-50-fp.xml -t 10 --hint none --nstreams 4 -b 64 --infer_precision f16 | grep Throughput
[ INFO ] Throughput:          5915.62 FPS
$ ./benchmark_app -d GPU.1 -m resnet-50-int8.xml -t 10 --hint none --nstreams 4 -b 64 | grep Throughput
[ INFO ] Throughput:          12270.12 FPS

As you can see from the execution log, fp16 runs ~5.49x faster than fp32. Int8 throughput is ~2.07x higher than fp16. The difference between fp16 and fp32 is due to fp16 acceleration from XMX while fp32 is not using XMX. The performance gain of int8 over fp16 is 2.07x because both are accelerated with XMX.

Performance gain from parallel stream execution

You can see from the log below that performance goes up as we have more streams up to 4. It is because the GPU can handle 4 streams in parallel.

$./benchmark_app -d GPU.1 -m resnet-50-int8.xml -t 10 --hint none --nstreams 1 -b 64 | grep Throughput
[ INFO ] Throughput:          8593.92 FPS
$./benchmark_app -d GPU.1 -m resnet-50-int8.xml -t 10 --hint none --nstreams 2 -b 64 | grep Throughput
[ INFO ] Throughput:          10610.98 FPS
$./benchmark_app -d GPU.1 -m resnet-50-int8.xml -t 10 --hint none --nstreams 4 -b 64 | grep Throughput
[ INFO ] Throughput:          12246.29 FPS
$./benchmark_app -d GPU.1 -m resnet-50-int8.xml -t 10 --hint none --nstreams 8 -b 64 | grep Throughput
[ INFO ] Throughput:          12150.30 FPS

Note that if the inference workload is large enough, more streams might not bring much or any performance gain. For example, when increasing the batch size, throughput may saturate earlier than at 4 streams.

How to take advantage the improvements in your application

For XMX, all you need to do is run your int8 or fp16 model with the OpenVINO™ Runtime version 2022.3 or above. If the model is fp32(single precision), it will not be accelerated by XMX. To quantize a model and create an OpenVINO int8 IR, please refer to Quantizing Models Post-training. To create an OpenVINO fp16 IR from a fp32 floating-point model, please refer to Compressing a Model to FP16 page.

For parallel stream execution, you can set throughput hint as described in Optimizing for Throughput. It will automatically set the number of parallel streams with best number.


In this article, we introduced two key features of Intel®’s discrete GPUs: XMX and parallel stream execution. Most int8/fp16 deep learning networks can benefit from the XMX engine with no additional configuration. When properly configured by the application, parallel stream execution can bring significant performance gains too!

[1] In the Xe-HPG architecture, the XMX delivers 256 INT8 ops per clock (DPAS), while the (non-systolic) Xe Core vector engine delivers 64 INT8 ops per clock – a 4x throughput increase [reference]. In the Xe-HPC architecture, the XMX systolic array depth has been increased to 8 and delivers 4096 FP16 ops per clock, while the (non-systolic) Xe Core vector engine delivers 512 FP16 ops per clock – a 8x throughput increase [reference].

Notices & Disclaimers

​Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex​​.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available ​updates.  See backup for configuration details.  No product or component can be absolutely secure.​​

​​​​See backup for configuration details.  For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

© Intel Corporation.  Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.  Other names and brands may be claimed as the property of others​.​​



Remote Tensor API Sample

January 28, 2023

This AI pipeline implements zero-copy between SYCL and OpenVINO through the Remote Tensor API of the GPU Plugin.

  1. Introduction

The development of SYCL simplifies the use of OpenCL, which can fully exploit the computing power of GPU in the pipeline. Meanwhile, SYCL has more flexibility to do customized pre- and post-processing of OpenVINO. To further optimize the pipeline, developers can use GPU Plugin to avoid the memory copy overhead between SYCL and OpenVINO. The GPU plugin provides the ov::RemoteContext and ov::RemoteTensor interfaces for video memory sharing and interoperability with existing native APIs, such as OpenCL, Microsoft DirectX, or VAAPI. For details, please refer to the online documentation of OpenVINO.

Based on the pseudocode of the online documentation, here we provide a simple pipeline sample with Remote Tensor API. Because in the rapid iteration of oneAPI, sometimes customers need quick verification so that this sample can be used for testing. OneAPI also provides a real-world, end-to-end example, which optimizes PointPillars for lidar object detection.

  1. Components

SYCL preprocessing is based on the Sepia Filter sample, which demonstrates how to convert a color image to a Sepia tone image, a monochromatic image with a distinctive Brown Gray color. The sample program works by offloading the compute-intensive conversion of each pixel to Sepia tone using SYCL*-compliant code for CPU and GPU.

OpenVINO inferencing is based on the OpenVINO classification sample, the input from SYCL filtered image in the device will be sent into OpenVINO as a remote tensor without a memory copy.

Remote Tensor API: Create RemoteContext from SYCL pre-processing’s native handle. After model compiling, do memory sharing between the application and GPU plugin with from cl::Buffer to remote tensor.

auto cl_queue = get_native<backend::opencl>(sycl_queue);
auto remote_context = ov::intel_gpu::ocl::ClContext(core, cl_queue);
ov::CompiledModel compiled_model = core.compile_model(model, remote_context);
auto cl_buffers = get_native<backend::opencl>(image_buf_out);
auto remote_tensor = remote_context.create_tensor(ov::element::u8, {batch, input_height, input_width, 3}, cl_buffers);
infer_request.set_tensor(input_tensor_name, remote_tensor);
  1. Build Sample on Linux

Download the source code from sandbox. Prepare the model and images.

To run the sample, you need to specify a model and image:

Use pre-trained models from the Open Model Zoo. The models can be downloaded using the Model Downloader. Use images from the media files collection.

 source setupvars.sh
mkdir build
cd build
cmake ..

Run on Intel NUC Core 11 iGPU with OpenVINO 2022.2 and oneAPI 2022.3.

./intel64/hello_nv12_input_classification_oneAPI../model/FP32/alexnet.xml ../image/dog512.bmp GPU 2

Sample Output:

  Loaded image with a width of 512, a height of 512 and 3 channels
[ INFO ] OpenVINO Runtime version ......... 2022.2.0
[ INFO ] Build ........... 2022.2.0-7713-af16ea1d79a-releases/2022/2
[ INFO ] 
[ INFO ] Loading model files: ../model/FP32/alexnet.xml
Running on Intel(R) Iris(R) Xe Graphics [0x9a49]
---Load model - 1126ms
---Create an infer request - 0.515ms
Use remote tensor API and set_tensor
No 1. do inf: 
---sycl buffer  - 0.005ms
---sycl filter total time - 93.342ms
---kernel time: 0.051822 milliseconds
---Set tensor - 0.138ms
---Run infer req - 4.725ms
---get tensor - 0.062ms

Top 3 results:

Image ../image/dog512.bmp

classid probability label
------- ----------- -----
176     0.4579958   "Saluki, gazelle hound"
212     0.2727944   "English setter"
169     0.0513433   "borzoi, Russian wolfhound"

---sum of inference - 98.268ms
No 2. do inf: 
---sycl buffer  - 0.001ms
---sycl filter total time - 0.582ms
---kernel time: 0.05177 milliseconds
---Set tensor - 0.086ms
---Run infer req - 4.162ms
---get tensor - 0.049ms

Top 3 results:

Image ../image/dog512.bmp

classid probability label
------- ----------- -----
176     0.4579958   "Saluki, gazelle hound"
212     0.2727944   "English setter"
169     0.0513433   "borzoi, Russian wolfhound"

---sum of inference - 4.881ms 

Warning: With the updating of OpenVINO and oneAPI, different versions may cause problems with the tools in the common directory or the new SYCL header name. Please use the same version or debug following the corresponding release instructions.


Intel® DL Streamer Optimize Media-AI pipeline on Intel® Data Center Flex dGPU by Docker

December 14, 2022


In this blog is about How to use DL-streamer to build a complete Media-AI pipeline (Including: Video Access, Media Decode, AI Inference, Media Encode and Result Export). And the pipeline will be accelerated by OpenVINO™ and optimize to run on Flex dGPU(Intel® Data Center Flex dGPU)


- DL-streamer
Intel® Deep Learning Streamer (Intel® DL Streamer)Pipeline Framework is an easy way to construct media analytics pipelines using Intel® Distribution of OpenVINO™ Toolkit. It leverages the open source media framework GStreamer to provide optimized media operations and Deep Learning Inference Engine from OpenVINO™ Toolkit to provide optimized inference.

- OpenVINO
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference which can boost deep learning performance in computer vision, automatic speech recognition, natural language processing and other common task.

- Docker (Optional)
Docker is an open-source platform that enables developers to build, deploy, run, update, and manage containers—standardized, executable components that combine application source code with the operating system (OS) libraries and dependencies required to run that code in any environment.

Install DL-Streamer and OpenVINO™ via Docker

Images for Intel® Data Center GPU Flex Series

Images 2022.2.0-ubuntu20-gpu419.40* are intended for Intel® Data Center GPU Flex Series and include

1.     Intel®DL Streamer 2022.2-release

2.    OpenVINO™ Toolkit 2022.2.0

3.    Drivers for Intel® Data Center GPU Flex Series, drivers version 419.40

Four images are listed below, images -devel additionally contain samples and development files, images with -dpcpp additionally contain Intel® oneAPI DPC++/C++ Compiler

Runtime image that includes GStreamer* Pipeline Framework elements

docker pull intel/dlstreamer:2022.2.0-ubuntu20-gpu419.40

Developer image that builds on runtime image containing samples,development files and a model downloader.

docker pull intel/dlstreamer:2022.2.0-ubuntu20-gpu419.40-devel

Runtime image including elements built with Intel® oneAPI DPC++/C++ Compiler

docker pull intel/dlstreamer:2022.2.0-ubuntu20-gpu419.40-dpcpp

Developer image for elements built with Intel® oneAPI DPC++/C++ Compiler

docker pull intel/dlstreamer:2022.2.0-ubuntu20-gpu419.40-dpcpp-devel

Taking “2022.2.0-ubuntu20-gpu419.40” docker images as a sample to show how to pull the docker image from docker hub.

docker pull intel/dlstreamer:2022.2.0-ubuntu20-gpu419.40
Flag 1. docker pull images from docker hub

DL-Streamer Media-AI pipeline quick start example

Make sure the pre-requirement had already installed, there is a very basic introduction to using object detection models(yolov4) to build a DL-streamer pipeline.

Step 1.Download video and yolov4-tf model file

Download video, Enter the following link into your browser to download the related files

Download yolov4-tf model

git clone https://github.com/dlstreamer/pipeline-zoo-models.git

Step 2.Enter Docker and copy the files into docker container

Create andenter the docker container

docker run -it --device /dev/dri/ --user root --rm intel/dlstreamer:2022.2.0-ubuntu20-gpu419.40

Open another terminal for file copy into container ,copy video and model into docker container

docker cp pexels-george-morina-5325136.mp4 <Docker CONTAINER ID>:/home/dlstreamer
docker cp pipeline-zoo-models/storage/yolo-v4-tf_INT8 <Docker CONTAINER ID>:/home/dlstreamer
Figure 2. Copy video and IR model into docker

Step 3. Run an objectdetection Media-AI pipeline

By the following script, we can run pipeline the Media-AI objection detection on the Flex dGPU in the docker container.

gst-launch-1.0 filesrc location=/path/to/pexels-george-morina-5325136.mp4 ! decodebin ! vaapipostproc ! gvadetect model=/path/to/yolo-v4-tf_INT8/yolo-v4-tf_INT8.xml model_proc=/path/to/yolo-v4-tf_INT8/yolo-v4-tf_INT8.json device=GPU batch-size=32 pre-process-backend=vaapi-surface-sharing ! queue ! gvatrack tracking-type=short-term-imageless ! gvafpscounter ! fakesink sync=false
Figure 3. DL-streamer run pipeline on the dGPU

If want to encode the detection result and save as video file, can use the follow script

gst-launch-1.0 filesrc location=/path/to/pexels-george-morina-5325136.mp4 ! decodebin ! vaapipostproc ! gvadetect model=/path/to/yolo-v4-tf_INT8/yolo-v4-tf_INT8.xml  model_proc=/path/to/yolo-v4-tf_INT8/yolo-v4-tf_INT8.json device=GPU batch-size=32 pre-process-backend=vaapi-surface-sharing ! queue ! gvatrack tracking-type=short-term-imageless !  gvafpscounter ! vaapipostproc ! vaapih265enc rate-control=cbr bitrate=4096 ! filesinklocation=./encoded_video_track.265 sync=false

The encoded video file will save in the container and can be copied out in new terminal.

docker cp <Docker CONTAINER ID>:/home/dlstreamer encoded_video_track.265 .
Figure 4. DL-streamer yolov4 pipeline result

PS. Instruction aboutDL-streamer CLI parameter

decodebin: Auto-magically constructs a decoding pipeline using available decoders and demuxers via auto-plugging.

vaapipostproc: Consists in various post processing algorithms to be applied to VA surfaces. For e.g. scaling, deinterlacing (bob, motion-adaptive, motion-compensated), noise reduction or sharpening.

gvadetect: Performs object detection on a full-frame or region of interest (ROI) using object detection models such as YOLO v3-v5, MobileNet-SSD, Faster-RCNN etc. Outputs the ROI for detected objects.

gvatrack: Performs object tracking using zero-term, zero-term-imageless, or short-term-imageless tracking algorithms. Zero-term tracking assigns unique object IDs and requires object detection to run on every frame. Short-term tracking allows to track objects between frames, there by reducing the need to run object detection on each frame. Imageless tracking forms object associations based on the movement and shape of objects, and it does not use image data.

gvafpscounter: Measures frames per second across multiple streams in a single process.

Tuning Tips

Users can refer the different platform using case which were supported by OpenVINO™ and the device profiling API to realize performance tuning of your inference program between CPU, iGPU, dGPU. It will also be helpful to developer finding out the place where has the potential space of performance improvement.