Serving OpenVINO Models using the KServe API Standard

There are many network API specifications for model serving on the market today. Two of the most popular are TensorFlow Serving (TFS) and KServe. Starting with the 2022.2 release, OpenVINO Model Server supports KServe -- meaning both of these common API standards can be used for serving OpenVINO models. This blog explains how to take advantage of either API.

OpenVINO provides an efficient and high-performance runtime for executing deep learning inference. In many situations, AI applications need to delegate inference execution to a remote device or service over a network. There are many advantages to this approach including the ability to scale.

Figure 1. Example of a client sending image input to a model server for image classification.

AI software developers expect the communication interface with a model server to remain stable. In many cases, developers want to perform pre/post-processing on the client side with minimal dependencies. They are reluctant to switch to a different serving implementation if that requires substantial code changes or new dependencies in their applications.

Since the first release in 2018, OpenVINO Model Server has supported the TFS API. And as of 2022, the KServe API is now supported as well.

KServe is a standard designed by several companies across the industry. It has been adopted by model servers like Triton Inference Server and TorchServe. Now the same client can easily switch to use OpenVINO Model Server and leverage the latest optimizations in Intel(R) CPUs and GPUs.

KServe Python Example

Below is a simple example how to use KServe using the Python-based tritonclient.

Create Model Repository

wget{xml,bin} -P models/resnet/1
tree models
	└── resnet
		└── 1        

Start OpenVINO Model Server with a ResNet-50 Model:

docker run --rm -d -v $(pwd)/models:/models -p 9000:9000 -p 8000:8000 openvino/model_server:latest --model_name resnet --model_path /models/resnet --port 9000 --rest_port 8000 --layout NHWC:NCHW --plugin_config '{"PERFORMANCE_HINT":"LATENCY"}'

Install Python Client Library

pip install tritonclient[grpc]==2.22.3

Get the Model Metadata

import tritonclient.grpc as grpcclientclient = grpcclient.InferenceServerClient("localhost:9000")print(client.get_model_metadata("resnet"))

Get a Sample Image


JPEG image of a bee

Run Inference via gRPC Interface with a NumPy File as Input Data

pip install opencv-python
import tritonclient.grpc as grpcclient
import cv2
import numpy as np
img = cv2.imread("bee.jpeg").astype(np.float32)
img = cv2.resize(img, (224, 224))
img = img.reshape(1,224,224,3)
triton_client = grpcclient.InferenceServerClient("localhost:9000")
inputs =[grpcclient.InferInput("0", img.shape, "FP32")]
results = triton_client.infer(model_name="resnet", inputs=inputs)
probabilities = results.as_numpy('1463')

Run Inference via REST Interface with a JPEG File as Input Data

pip install requests
import requests
import json
import numpy as np

image = open("bee.jpeg", 'rb')
image_data = []

http_session = requests.session()
inference_header = {"inputs":[{"name":"0","shape":[1],"datatype":"BYTES"}]}
inference_header_binary = json.dumps(inference_header).encode()
results ="http://localhost:8000/v2/models/resnet/versions/0/infer", inference_header_binary + b''.join(image_data), headers={"Inference-Header-Content-Length":str(len(inference_header_binary))})
probabilities = np.array(json.loads(results.text)['outputs'][0]['data'])


Run Inference via REST Interface with a JPEG File as Input Data using cURL

echo -n '{"inputs" : [{"name" : "0", "shape" : [1], "datatype" : "BYTES"}]}' > request.json
stat --format=%s request.json
cat ./bee.jpeg >> request.json
curl --data-binary "@./request.json" -X POST http://localhost:8000/v2/models/resnet/versions/0/infer -H "Inference-Header-Content-Length: 66"

KServe C++ Example

The inference execution is also made easy in C++ based client applications. The examples below show client application execution based on the Triton C++ client library.

Build the Samples:

git clone –b develop
cd model_server/client/cpp/kserve-api 
cmake . && make && cd samples


Get the Model Metadata

The compiled application grpc_model_metadata can make a call to gRPC endpoint and query for a server model metadata.

./grpc_model_metadata --grpc_port 9000 --grpc_address localhost --model_name resnet
model metadata:
name: "resnet"
versions: "1"
platform: "OpenVINO"
inputs {
  name: "0"
  datatype: "FP32"
  shape: 1
  shape: 224
  shape: 224
  shape: 3
outputs {
  name: "1463"
  datatype: "FP32"
  shape: 1
  shape: 1000

Run Inference via gRPC with a JPEG Encoded File as the Input Data

The sample application grpc_infer_resnet is sending the inference requests for a set of images listed inresnet_input_images.txt including their expected classification number in the ImageNet dataset.

./grpc_infer_resnet --images_list resnet_input_images.txt --labels_list resnet_labels.txt --grpc_port 9000  

airliner.jpeg classified as 404 airliner 
arctic-fox.jpeg classified as 279 Arctic fox, white fox, Alopex lagopus 
bee.jpeg classified as 309 bee 
golden_retriever.jpeg classified as 207 golden retriever 
gorilla.jpeg classified as 366 gorilla, Gorilla gorilla 
magnetic_compass.jpeg classified as 635 magnetic compass 
peacock.jpeg classified as 84 peacock 
pelican.jpeg classified as 144 pelican 
snail.jpeg classified as 113 snail 
zebra.jpeg classified as 340 zebra 
Accuracy 100%

======Client Statistics======
Completed request count 10
Cumulative total request time 110.314 ms

In addition to the KServe API, the TFS API can still be used by client applications. This gives you the option to use a range of client libraries like tensorflow-serving-api or the much lighter and simplified ovmsclient.

To help you get started, we provide samples in Python, C++, Java and Go:

In conclusion, it is now easier to connect and AI applications to OpenVINO Model Server. In existing applications, you can even use the same code to take advantage of the benefits of OpenVINO.