CPU Dispatcher Control for OpenVINO™ Inference Runtime Execution

Introduction

CPU plugin of OpenVINO™ toolkit as one of the most important part, which is powered by oneAPI Deep Neural Network Library (oneDNN) can help user achieve high performance inference of neural networks on Intel®x86-64 CPUs. The CPU plugin detects the Instruction Set Architecture (ISA) in the runtime and uses Just-in-Time (JIT) code generation to deploy the implementation optimized for the latest supported ISA.

In this blog, you will learn how layer primitives been optimized by implementation of ISA extensions and how to change the ISA extensions’ optimized kernel function at runtime for performance tuning and debugging.

After reading this blog, you will start to be proficient in AI workloads performance tuning and OpenVINO™ profiling on Intel® CPU architecture.

CPU Profiling

OpenVINO™ provide Application Program Interface (API) which is easy to turn on CPU profiling and analyze performance of each layer from the bottom level by executed kernel function. Firstly, enable performance counter profiling with executed device during device property configuration before model compiling with device. Learn detailed information from document of OpenVINO™ Configuring Devices.

core.set_property("CPU", ov::enable_profiling(true));

Then, you are allowed to get object of profiling info from inference requests which complied with the CPU device plugin.

auto perfCounts = infer_request.get_profiling_info();

Please note that performance profiling information generally can get after model inference. Refer below code implementation and add this part after model inference. You are possible to get status and performance of layer execution. Follow below code implement, you will get performance counter printing in order of the execution time from largest to smallest.


bool sort_pc_descend(const ov::ProfilingInfo& profiling1, const ov::ProfilingInfo& profiling2) {
    return profiling1.real_time > profiling2.real_time;
}
...
int tmain(int argc, tchar* argv[]) {
	...
	//objects init
	int layersize = 0;
	bool sort_pc_descend = 1;
	std::chrono::microseconds total = std::chrono::microseconds::zero();
	std::chrono::microseconds total_cpu = std::chrono::microseconds::zero();
	static const char* status_names[] = {"NOT_RUN", "OPTIMIZED_OUT", "EXECUTED"};

	//print row of title
	std::cout << "layerName\t"
	    << "execStatus\t"
	    << "layerType:"
	    << "execType\t";
	std::cout << "realTime (ms)\t"
	    << "cpuTime (ms)\t"
	    << " proportion(%)" << std::endl;

	//calculate executed layers total latency
	for (const auto& layer : perfCounts) {
	    if (std::string(status_names[(int)layer.status]).compare("EXECUTED") == 0) {
	        total += layer.real_time;
	        total_cpu += layer.cpu_time;
	    }
	}

	//print executed layer name, status, execution kernel funtion, CPU execution time and percentage of total model latency
	std::sort(perfCounts.begin(), perfCounts.end(), sort_pc_descend);
	for (const auto& layer : perfCounts) {
	    if (std::string(status_names[(int)layer.status]).compare("EXECUTED") == 0) {
	        std::cout << layer.node_name << "\t";
	        std::cout << ((int)layer.status < (sizeof(status_names) / sizeof(status_names[0]))
	                    ? status_names[(int)layer.status]
	                    : "INVALID_STATUS") << "\t";
	        std::cout << layer.node_type << ":" << layer.exec_type << "\t";
	        std::cout << std::to_string(layer.real_time.count() / 1000.0) << "\t"
	            << std::to_string(layer.cpu_time.count() / 1000.0) << "\t";
	        std::cout << (layer.real_time * 1.0 / total) * 100 << std::endl;
	        layersize += 1;
	    }
	}
}

CPU Dispatching

By enabling device profiling and printing exec_type of layers, you will get the specific kernel functions which powered by oneDNN during runtime execution. Use TensorFlow* ResNet 50 INT8 model for execution and pick the first 10 hotspot layers on 4th Gen Intel® Xeon Scalable processor (code named Sapphire Rapids) as an example:

Figure 1. OpenVINO™ CPU profiling with Intel® AMX on Sapphire Rapids

From execution type of layers, it would be helpful to check which oneDNN kernel function used, and the actual precision of layer execution and the optimization from supported ISA on this platform.

Normally, oneDNN is able to detect to certain ISA, and OpenVINO™ allow to use latest ISA with higher priority. If you want to compare optimization rate between different ISA, can use the ONEDNN_MAX_CPU_ISA environment variable to limit processor features with older instruction sets. Follow this link to check oneDNN supported ISA.

Please note, Intel® Advanced Matrix Extensions (Intel® AMX) ISA start to be supported since 4th Gen Intel® Xeon Scalable processor. You can refer Intel® Product Specifications to check the supported instruction set of your current platform.

The ISAs are partially ordered:

·      SSE41 < AVX < AVX2 < AVX2_VNNI <AVX2_VNNI_2,

·      AVX2 < AVX512_CORE < AVX512_CORE_VNNI< AVX512_CORE_BF16 < AVX512_CORE_FP16 < AVX512_CORE_AMX <AVX512_CORE_AMX_FP16,

·      AVX2_VNNI < AVX512_CORE_FP16.

To use CPU dispatcher control, just set the value of ONEDNN_MAX_CPU_ISA environment variable before executable program which contains the OpenVINO™ device profiling printing, you can use benchmark_app as an example:

ONEDNN_MAX_CPU_ISA=AVX512_CORE_VNNI benchmark_app -m ~/models/public/resnet-50-tf/FP32-INT8/resnet-50-tf.xml -d CPU -pcsort sort

The benchmark_app provides the option which named “-pcsort” can report performance counters and order analysis information by order of layers execution time when set value of the option by “sort”.

In this case, we use above code implementation can achieve similar functionality of benchmark_app “-pcsort” option. User can consider try to add the code implementation into your own OpenVINO™ program like below:

ONEDNN_MAX_CPU_ISA=AVX512_CORE_VNNI ./classification_profiling ~/models/public/resnet-50-tf/FP32-INT8/resnet-50-tf.xml ./sample_640×426.bmp CPU

After setting the CPU dispatcher, the kernel execution function has been switched from AVX512_CORE_AMX to AVX512_CORE_VNNI. Then, the performance counters information would be like below:

Figure 2. CPU profiling with AVX512_CORE_VNNI

You can easily find the hotspot layers of the same model would be changed when executed by difference kernel function which optimized by implementation of different ISA extensions. That is also the optimization differences between architecture platforms.

Tuning Tips

Users can refer the CPU dispatcher control and OpenVINO™ device profiling API to realize performance tuning of your inference program between CPU architectures. It will also be helpful to developer finding out the place where has the potential space of performance improvement.

For example, the hotspot layer generally should be compute-intensive operations like matrix-matrix multiplication; General vector operations which is not target to artificial intelligence (AI) / machine learning (ML) workloads cannot be optimized by Intel® AMX and Intel® Deep Learning Boost (Intel® DL Boost), and the memory accessing operations, like Transpose which maybe cannot parallelly optimized with instruction sets. If your inference model remains large memory accessing operations rather than compute-intensive operations, you probably need to be focusing on RAM bandwidth optimization.