OpenVINO Extension operation by SYCL program on CPU

In this blog, we will introduce the path that how OpenVINO support extensibility on CPU platform, and a sample of creating one custom operation by implement a SYCL program on CPU. oneAPI has two programming modes, one is through direct programming by SYCL which is C++ based language; another is based on acceleration libraries. In this sample we will use oneAPI DPC++ compiler to support SYCL program compiling in custom extension library, so that if users familiar with SYCL optimization can refer the OpenVINO extension mechanism to support and optimize their own operation kernel.

 

First of all, you should understand the interface and invoke scheduling of extension operations through OpenVINO core API. OpenVINO support to create a custom operation which is inherited from ov::op::Op and realize the member function “evaluate()” with SYCL implementation. Then, register this customer operation by “ov::OpExtension” to generate a runtime library of OpenVINO extensions. Finally, we will enable the custom extension library can be called by “add_extension()” function by Core API in runtime.

 

The next step is to create an IR model with this extension operation. We will introduce a method to create OV model by using OpenVINO opset and modify the layer version to extension make sure Core API can invoke operation registered in the extension library.

System requirement

Please make sure you already correctly install the OpenVINO C++ package from:

https://storage.openvinotoolkit.org/repositories/openvino/packages/

And setup environment variable for OpenVINO by:


source ./l_openvino_toolkit_ubuntu22_2024.0.0.14488.5e7e51dc778_x86_64/setupvars.sh

Then, install the DPC++ compiler, and source the environment variable:


source /opt/intel/oneapi/setvars.sh

In this blog, we create a customized “SYCL_Add” operation, the folder and files structure like below:


.
|-add
 | |-add.cpp
 | |-add.hpp
 |-CMakeLists.txt
 |-ov_extension.cpp

Step 1: Create custom operation by SYCL kernel.

For example, we create a custom operation to realize the functionality of “Add” and named it as “SYCL_Add”. We define this operation with header “add.hpp”:


#pragma once

//! [op:common_include]
#include <openvino/op/op.hpp>
#include <vector>
//! [op:common_include]

//! [op:header]
namespace TemplateExtension {

class Add : public ov::op::Op {
public:
    OPENVINO_OP("SYCL_Add");

    Add() = default;
    Add(const ov::Output<ov::Node>& A, const ov::Output<ov::Node>& B);
    void validate_and_infer_types() override;
    std::shared_ptr<ov::Node> clone_with_new_inputs(const ov::OutputVector& new_args) const override;
    bool visit_attributes(ov::AttributeVisitor& visitor) override;

    bool evaluate(ov::TensorVector& outputs, const ov::TensorVector& inputs) const override;
    bool has_evaluate() const override;


private:
};
//! [op:header]

}  // namespace TemplateExtension

Then, we need to override the member functions of this new operation, especially the implementation of “evaluate()”.If this blog, we will show an example of SYCL kernel. To enable SYCL programming on CPU, you are required to install the DPC++ compiler and include the header <sycl/sycl.hpp>. Below is the code implementation of “add.cpp”:


// Copyright (C) 2018-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#include "add.hpp"
#include <sycl/sycl.hpp>

using namespace TemplateExtension;
using namespace sycl;

//! [op:ctor]
Add::Add(const ov::Output<ov::Node>& A, const ov::Output<ov::Node>& B): Op(ov::OutputVector{A,B}){
    constructor_validate_and_infer_types();
}
//! [op:ctor]

//! [op:validate]
void Add::validate_and_infer_types() {
    auto outShape = get_input_partial_shape(0);
    set_output_type(0, ov::element::Type_t::i32, outShape);
}
//! [op:validate]

//! [op:copy]
std::shared_ptr<ov::Node> Add::clone_with_new_inputs(const ov::OutputVector& new_args) const {
    OPENVINO_ASSERT(new_args.size() == 2, "Incorrect number of new arguments");
    return std::make_shared<Add>(new_args.at(0), new_args.at(1));
}
//! [op:copy]

//! [op:visit_attributes]
bool Add::visit_attributes(ov::AttributeVisitor& visitor) {
    return true;
}
//! [op:visit_attributes]

void add_vectors(sycl::queue& queue, sycl::buffer<float>& a, sycl::buffer<float>& b, sycl::buffer<float>& c, int& N) {
   //sycl::range n(a.size());

   queue.submit([&](sycl::handler& cgh) {
      auto in_a_accessor = a.get_access<sycl::access::mode::read>(cgh);
      auto in_b_accessor = b.get_access<sycl::access::mode::read>(cgh);
      auto out_c_accessor = c.get_access<sycl::access::mode::write>(cgh);

      cgh.parallel_for(range<1>(N), [=](sycl::id<1> i) {
               out_c_accessor[i] = in_a_accessor[i] + in_b_accessor[i];
      });
   });
}

//! [op:evaluate]
bool Add::evaluate(ov::TensorVector& outputs, const ov::TensorVector& inputs) const {
    //std::cout << ".........Add SYCL Impl execute.........." << std::endl;

    float* src_0_ptr = reinterpret_cast<float*>(inputs[0].data());
    float* src_1_ptr = reinterpret_cast<float*>(inputs[1].data());
    float* dst_ptr = reinterpret_cast<float*>(outputs[0].data());

    sycl::queue Q;

    std::vector<size_t> in_dims = inputs[0].get_shape();

    int len = static_cast<int>(in_dims[0]);
    for(int i=1;i<in_dims.size();i++){
        len = len * static_cast<int>(in_dims[i]);
    }

    sycl::buffer<float,1> src_0(src_0_ptr, sycl::range<1>(len));
    sycl::buffer<float,1> src_1(src_1_ptr, sycl::range<1>(len));
    sycl::buffer<float,1> dst(dst_ptr, sycl::range<1>(len));

    add_vectors(Q, src_0, src_1, dst, len);

    return true;
}

bool Add::has_evaluate() const {
    return true;
}
//! [op:evaluate]

As you can see, in this SYCL kernel implementation, there require creating buffer objects which can be managed on device and create accessors to control the accessing of these buffers. So, it remains buffer type conversion between C++ float pointer and SYCL float buffer. The idea of SYCL programming is like OpenCL for heterogeneous platform like GPU/NPU which remains buffer management and synchronization between host and device. This sample is just for CPU extension, there’s no use with device memory.

Step 2: Register custom operation as extension.

To register the customer operation by “ov::OpExtension”,refer below code of “ov_extension.cpp”:


// Copyright (C) 2018-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#include <openvino/core/extension.hpp>
#include <openvino/core/op_extension.hpp>
#include <openvino/frontend/extension.hpp>
#include "add/add.hpp"

//! [ov_extension:entry_point]
OPENVINO_CREATE_EXTENSIONS(
std::vector<ov::Extension::Ptr>({
std::make_shared<ov::OpExtension<TemplateExtension::Add>>(),
std::make_shared<ov::frontend::OpExtension<TemplateExtension::Add>>()
})
);
//! [ov_extension:entry_point]

Then, you can create “CMakeLists.txt” file like below. Make sure use the DPC++ compiler with option “-fsycl”.


cmake_minimum_required(VERSION 3.16)
project(custom_layer)
set(CMAKE_CXX_STANDARD 17)

set(TARGET_NAME "custom")
set(CMAKE_CXX_COMPILER "icpx")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsycl -O3 -std=c++17 -mavx512f -mavx512vl -mavx512pf -mavx512er -mavx512cd")
find_package(OpenVINO REQUIRED)
add_library(${TARGET_NAME} MODULE
        ${CMAKE_SOURCE_DIR}/ov_extension.cpp
        ${CMAKE_SOURCE_DIR}/add/add.cpp
        ${CMAKE_SOURCE_DIR}/add/add.hpp
        )
target_compile_definitions(${TARGET_NAME} PRIVATE IMPLEMENT_INFERENCE_EXTENSION_API)
target_link_libraries(${TARGET_NAME} PRIVATE openvino::runtime)

Use cmake to compile the runtime library for the extension operation. If you have more operations, just add source files into “add_library()”. Then we can get the runtime library called “libcustom.so”.If you meet any problem about compiler icpx, please make sure you already correctly install the DPC++ compile, and source the environment variable.

Step 3: Create IR model by OpenVINO opset

Here introduces a hack method to create ancustom operation “SYCL_Add” by exist OpenVINO opset. Due to the parameter and nodeinput/output of custom op is same as “ov::op::v1::Add”, thus we can use thismethod.

 

Firstly, create a python program to build OpenVINO IR model with “ov::op::v1::Add”. You can also use OpenVINO C++ API to create model, here use Python code just for quick verification.


from openvino.runtime import Core, Model, Tensor, Type
import openvino.runtime as ov
from openvino.runtime import opset11 as opset

def model():
    data1 = opset.parameter([-1,-1,-1,-1], Type.i32, name='input_1')
    data2 = opset.parameter([-1,-1,-1,-1], Type.i32, name='input_2')
    SYCL_add = opset.add(data1,data2,auto_broadcast='numpy',name="Add")
    SYCL_add.set_friendly_name("Add")
    Result = opset.result(SYCL_add, name='output_add')
return Model([Result],[data1,data2])

core = Core()
m = model()
ov.save_model(m, "SYCL_add.xml")

Now, you will get the IR model with OpenVINO “opset.Add”. We can directly modify the “.xml” like below, change the type of this layer to “SYCL_Add” and modify the version of the layer to “extension”.

manually modify layer type and version to extension operation

Step 4: Run and profile the model execution with the SYCLextension library.

Now, you can quick check the workable and performance by OpenVINO benchmark_app sample:


$ ./benchmark_app -m ~/POC/sycl_custom/SYCL_add.xml -extensions ~/POC/sycl_custom/build/libcustom.so -data_shape input_1[64,64,64,64], input_2[64,64,64,64] -t 1 -pc

You can check the execution time of yourSYCL kernel:

[ INFO ] Performance counts for 0-th infer request
input_1              Status.NOT_RUN       layerType: Parameter            execType: unknown_i32          realTime (ms): 0.000      cpuTime (ms): 0.000
input_2              Status.NOT_RUN       layerType: Parameter            execType: unknown_i32          realTime (ms): 0.000      cpuTime (ms): 0.000
Add                  Status.EXECUTED      layerType: Reference            execType: ref_i32              realTime (ms): 21.977     cpuTime (ms): 21.977
output_add           Status.EXECUTED      layerType: Result               execType: unknown_i32          realTime (ms): 0.001      cpuTime (ms): 0.001
Total time:     21.978 milliseconds
Total CPU time: 21.978 milliseconds

Please note, the “execType” is using the ref_xxx means your custom reference implementation kernel with the data type.

Summary

This blog just shows the capable way to enable SYCL kernel as the extension of CPU plugin, we will not focusing on guiding the user implement the SYCL kernel like above programming. There are a lot of technic skills of kernel optimization, if you already have an efficient SYCL kernel and want to enable as the CPU extension to workaround some customized operations. We hope this blog will be helpful to you.