OpenVINO and OneDNN
OpenVINO™ is a framework designed to accelerate deep-learning models from DL frameworks like Tensorflow or Pytorch. By using OpenVINO, developers can directly deploy inference application without reconstructing the model by low-level API. It consists of various components, and for running inference on a GPU, a key component is the highly optimized deep-learning kernels, such as convolution, pooling, or matrix multiplication.
On the other hand, Intel® oneAPI Deep Neural Network Library (oneDNN), is a library that provides basic deep-learning building blocks, mainly kernels. It differs from OpenVINO in a way that OneDNN provides APIs for running deep-learning nodes like convolution, but not for running deep-learning models such as Resnet-50.
OpenVINO utilizes OneDNN GPU kernels for discrete GPUs, in addition to its own GPU kernels. It is to accelerate compute-intensive workloads to an extreme level on discrete GPUs. While OpenVINO already includes highly-optimized and mature deep-learning kernels for integrated GPUs, discrete GPUs include a new hardware block called a systolic array, which accelerates compute-intensive kernels. OneDNN provides these kernels with systolic array usage.
If you want to learn more about the systolic array and the advancements in discrete GPUs, please refer to this article.
How does OneDNN accelerates DL workloads for OpenVINO?
When you load deep-learning models in OpenVINO, they go through multiple stages called graph compilation. The purpose of graph compilation is to create the "execution plan" for the model on the target hardware.
During graph compilation, OpenVINO GPU plugin checks the target hardware to determine whether it has a systolic array or not. If the hardware has a systolic array(which means you have a discrete GPU like Arc, Flex, or GPU Max series), OpenVINO compiles the model so that compute-intensive layers are processed using OneDNN kernels.
OpenVINO kernels and OneDNN kernels use a single OpenCL context and shared buffers, eliminating the overhead of buffer-copying. For example, OneDNN layer computes a layers and fills a buffer, which then may be read by OpenVINO kernels because both kernels run in a single OpenCL context.
You may wonder why only some of the layers are processed by OneDNN while others are still processed by OpenVINO kernels. This is due to the variety of required kernels. OneDNN includes only certain key kernels for deep learning while OpenVINO contains many kernels to cover a wide range of models.
OneDNN is statically linked to OpenVINO GPU Plugin, which is why you cannot find the OneDNN library from released OpenVINO binary. The dynamic library of OpenVINO GPU Plugin includes OneDNN.
The GPU plugin and the CPU plugin have separate versions of OneDNN. To reduce the compiled binary size, the OpenVINO GPU plugin contains only the GPU kernels of OneDNN, and the OpenVINO CPU plugin contains only the CPU kernels of OneDNN.
Hands-on Tips and FAQs
What should an application developer do to take advantage of OneDNN?
If the hardware supports a systolic array and the model has layers that can be accelerated by OneDNN, it will be accelerated automatically without any action required from application developers.
How can I determine whether OneDNN kernels are being used or not?
You can check the OneDNN verbose log or the executed kernel names.
Set `ONEDNN_VERBOSE=1` to see the OneDNN verbose log. Then you will see a bunch of OneDNN kernel execution log, which means that OneDNN kernels are properly executed. Each execution of OneDNN kernel will print a line. If all kernels are executed without OneDNN, you will not see any of such log line.
$ ONEDNN_VERBOSE=1 ./benchmark_app -m resnet-50.xml -d GPU --niter 1
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
...
onednn_verbose,exec,gpu,convolution,jit:ir,forward_inference,src_s8::blocked:abcd:f0 wei_s8:p:blocked:AcdB8a4b:f0 bia_f32::blocked:a:f0 dst_u8::blocked:aBcd32b:f0,attr-post-ops:binary_mul:f32:2 ,alg:convolution_direct,mb1_ic3oc64_ih224oh112kh7sh2dh0ph3_iw224ow112kw7sw2dw0pw3,0.319092
onednn_verbose,exec,gpu,pooling,ocl:gen9,forward_inference,src_u8::blocked:aBcd32b:f0 dst_u8::blocked:aBcd32b:f0 ws_undef::undef::,,alg:pooling_max,mb1ic64_ih112oh56kh3sh2dh0ph0_iw112ow56kw3sw2dw0pw0,0.0788574
onednn_verbose,exec,gpu,convolution,jit:ir,forward_inference,src_u8::blocked:aBcd32b:f0 wei_s8::blocked:ABcd8b8a4b:f0 bia_f32::blocked:a:f0 dst_u8::blocked:aBcd32b:f0,attr-post-ops:binary_mul:f32:2 ,alg:convolution_direct,mb1_ic64oc64_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0,0.199951
onednn_verbose,exec,gpu,convolution,jit:ir,forward_inference,src_u8::blocked:aBcd32b:f0 wei_s8::blocked:ABcd8b8a4b:f0 bia_f32::blocked:a:f0 dst_u8::blocked:aBcd32b:f0,attr-post-ops:binary_mul:f32:2 ,alg:convolution_direct,mb1_ic64oc64_ih56oh56kh3sh1dh0ph1_iw56ow56kw3sw1dw0pw1,0.111084
onednn_verbose,exec,gpu,convolution,jit:ir,forward_inference,src_u8::blocked:aBcd32b:f0 wei_s8::blocked:ABcd8b8a4b:f0 bia_f32::blocked:a:f0 dst_s8::blocked:aBcd32b:f0,attr-post-ops:binary_mul:f32:2+binary_add:f32:2 ,alg:convolution_direct,mb1_ic64oc256_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0,0.0688477
onednn_verbose,exec,gpu,convolution,jit:ir,forward_inference,src_u8::blocked:aBcd32b:f0 wei_s8::blocked:ABcd8b8a4b:f0 bia_f32::blocked:a:f0 dst_u8::blocked:aBcd32b:f0,attr-post-ops:binary_mul:f32:2+binary_add:f32:2+eltwise_round+eltwise_linear:1.77854:-227.654+eltwise_clip:-227.654:225.875+sum:1:0:s8+eltwise_linear:1.59738 ,alg:convolution_direct,mb1_ic64oc256_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0,0.0771484
onednn_verbose,exec,gpu,convolution,jit:ir,forward_inference,src_u8::blocked:aBcd32b:f0 wei_s8::blocked:ABcd8b8a4b:f0 bia_f32::blocked:a:f0 dst_u8::blocked:aBcd32b:f0,attr-post-ops:binary_mul:f32:2 ,alg:convolution_direct,mb1_ic256oc64_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0,0.0678711
onednn_verbose,exec,gpu,convolution,jit:ir,forward_inference,src_u8::blocked:aBcd32b:f0 wei_s8::blocked:ABcd8b8a4b:f0 bia_f32::blocked:a:f0 dst_u8::blocked:aBcd32b:f0,attr-post-ops:binary_mul:f32:2 ,alg:convolution_direct,mb1_ic64oc64_ih56oh56kh3sh1dh0ph1_iw56ow56kw3sw1dw0pw1,0.108154
...
Alternatively, you can check the kernel names from performance counter option from benchmark_app. (--pc)
OneDNN layers include colon in the `execType` field as shown below. In this case, convolutions are handled by OneDNN jit:ir kernels. MaxPool is also handled by OneDNN kernel that is implemented with OpenCL.(and in this case, the systolic array is not used)
$ ./benchmark_app -m resnet-50.xml -d GPU --niter 1 -nstreams 1 --nireq 1 --hint none --pc | grep -v OPTIMIZED_OUT
[Step 1/11] Parsing and validating input arguments
...
input EXECUTED layerType: Parameter execType:
wait_for_events__u8 realTime (ms): 0.001 cpuTime (ms): 0.000
resnet_v1_50/po... EXECUTED layerType: MaxPool execType: ocl:gen9__u8 realTime (ms): 0.114 cpuTime (ms): 0.000
resnet_v1_50/bl... EXECUTED layerType: MaxPool execType: ocl:gen9__u8 realTime (ms): 0.070 cpuTime (ms): 0.000
resnet_v1_50/bl... EXECUTED layerType: MaxPool execType: ocl:gen9__u8 realTime (ms): 0.065 cpuTime (ms): 0.000
resnet_v1_50/bl... EXECUTED layerType: MaxPool execType: ocl:ref__u8 realTime (ms): 0.061 cpuTime (ms): 0.000
resnet_v1_50/pool5 EXECUTED layerType: ReduceMean execType: ocl:combined__u8 realTime (ms): 0.077 cpuTime (ms): 0.000
resnet_v1_50/Sp... EXECUTED layerType: Result execType: reorder_data_fast_b1__f32 realTime (ms): 0.014 cpuTime (ms): 0.003
resnet_v1_50/co... EXECUTED layerType: FakeQuantize execType: quantize_gpu_scale_shift_opt__i8 realTime (ms): 0.042 cpuTime (ms): 0.017
resnet_v1_50/co... EXECUTED layerType: Convolution execType: jit:ir__i8 realTime (ms): 0.524 cpuTime (ms): 0.000
resnet_v1_50/bl... EXECUTED layerType: Convolution execType: jit:ir__u8 realTime (ms): 0.129 cpuTime (ms): 0.000
resnet_v1_50/bl... EXECUTED layerType: Convolution execType: jit:ir__u8 realTime (ms): 0.123 cpuTime (ms): 0.000
...
Can we run networks without Onednn on discrete GPU?
It is not supported out-of-box and it is not recommended to do so because systolic array will not be used and the performance will be very different.
If you want to try without OneDNN still, you can follow this documentation and use `OV_GPU_DisableOnednn`.
How to know whether my GPU will be accelerated with OneDNN(or it has systolic array or not)?
You can use hello_query_device from OpenVINO sample app to check whether it has `GPU_HW_MATMUL` in `OPTIMIZATION_CAPABILITIES`.
$ ./hello_query_device
[ INFO ] Available devices:
[ INFO ] GPU
[ INFO ] SUPPORTED_PROPERTIES:
...
[ INFO ] Immutable: FULL_DEVICE_NAME : Intel(R) Arc(TM) A770 Graphics (dGPU)
...
[ INFO ] Immutable: OPTIMIZATION_CAPABILITIES : FP32 BIN FP16 INT8 GPU_HW_MATMUL EXPORT_IMPORT
How to check the version of OneDNN?
You can set `ONEDNN_VERBOSE=1` to check see the verbose log. Below, you can see that OneDNN version is v3.1 as an example. (OnnDNN 3.1 was used for OpenVINO 23.0 release)
Please note that it is shown only when OneDNN is actually used in the target hardware. If the model is not accelerated through OneDNN, OneDNN version will not be shown.
$ ONEDNN_VERBOSE=1 ./benchmark_app -m resnet-50.xml -d GPU --niter 1
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
...
[Step 7/11] Loading the model to the device
onednn_verbose,info,oneDNN v3.1.0 (commit f27dedbfc093f51032a4580198bb80579440dc15)
onednn_verbose,info,gpu,runtime:OpenCL
onednn_verbose,info,gpu,engine,0,name:Intel(R) Arc(TM) A770 Graphics,driver_version:23.17.26241,binary_kernels:enabled
Is it possible to try different OneDNN version?
As it is statically linked, you cannot try different OneDNN version from single OpenVINO version. It is also not recommended to build OpenVINO with different OneDNN version than it is originally built because we do not guarantee that it works properly.
How to profile OneDNN execution time?
Profiling is also integrated to OpenVINO. So you can use profiling feature of OpenVINO, such as --pc and --pcsort option from benchmark_app. However, it includes some additional overhead for OneDNN and it may report higher execution time than actual time especially for small layers. More reliable method is to use DevicePerformanceTiming with opencl-intercept-layers.