OpenVINO Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
##
Results
Sort By:
Title
|
Date
Milosz
Zeglarski

Use Metrics to Scale Model Serving Deployments in Kubernetes

December 6, 2022

In this blog you will learn how to set up horizontal autoscaling in Kubernetes using inference performance metrics exposed by OpenVINO™ Model Server. This will enable efficient scaling of model serving pods for inference on Intel® CPUs and GPUs.

autoscaling_animation.gif
Figure 1. Visualization of autoscaling OpenVINO Model Server using Prometheus metrics in Kubernetes

Why use custom metrics?

OpenVINO™ Model Server provides high performance AI inference on Intel CPUs and GPUs that can be scaled in Kubernetes. However, when it comes to automatic scaling in Kubernetes, the Horizontal Pod Autoscaler by default, relies on CPU utilization and memory usage metrics only. Although resource consumption indicates how busy the application is, it does not clearly say whether serving provides expected quality of service to the clients or not. Since OpenVINO Model Server exposes performance metrics, we can automatically scale based on service quality rather than resource utilization.

The first metric that comes to mind when thinking about service performance is the duration of request processing, otherwise known as latency. For example, mean or median over a specified period or latency percentiles. OpenVINO Model Server provides such metrics but setting autoscaling based on latency requires specific knowledge about each model and the environment where the inference is running in order to properly set thresholds that trigger scaling.

While autoscaling based on latency works and may be a good choice when you have model-specific knowledge, we will instead focus on a more generic metric using ovms_requests_streams_ratio. Let’s dive into what this means.

In the equation above:

  • currently_processed_requests - number of inference requests to a model being processed by the service at a given time.
  • execution_streams_number – number of execution streams. (When a model is loaded on the device, its computing units are divided into streams. Each stream independently handles inference requests, meaning that the number of streams defines how many inferences can be run on the device in parallel. Note that the more streams there are, the less powerful they are, so we get more throughput at a cost of higher minimal latency / inference time.)

In this equation, for any model exceeding a value of 1 indicates that requests are starting to queue up. Setting the autoscaler threshold for the ovms_requests_streams_ratio metric is somewhat of an arbitrary decision that should be made by a cluster administrator. Setting the threshold too low will result in underutilization of nodes and setting it too high will force the system to work with insufficient resources for extended periods of time. Now that we have chosen a metric for autoscaling, let’s start setting it up.

Deploy Model Server with Autoscaling Metrics

First, we need to create a deployment of OpenVINO Model Server in Kubernetes. To do this, follow instructions to install the OpenVINO Operator in your Kubernetes cluster. Then create a configuration where we can specify the model to be served and enable metrics:

apiVersion: v1 
kind: ConfigMap 
metadata: 
  name: ovms-config 
  namespace: default 
data: 
  ovms_config.json: | 
    { 
    "model_config_list": [ 
         { 
            "config": { 
                 "name": "resnet50-int8", 
                 "base_path": "gs://ovms-public-eu/resnet50-binary" 
            } 
         } 
     ], 
     "monitoring": 
         { 
             "metrics": 
             { 
                 "enable": true 
             } 
         } 
    }

Create ConfigMap:

kubectl apply -f https://raw.githubusercontent.com/openvinotoolkit/operator/main/examples/hpa_custom_metrics/ovms_config.yaml

With the configuration in place, we can deploy OpenVINO Model Server instance:

apiVersion: intel.com/v1alpha1 
kind: ModelServer 
metadata: 
  name: demo 
spec: 
  image_name: 'openvino/model_server:2022.2' 
  service_parameters: 
    grpc_port: 8080 
    rest_port: 8081 
  models_settings: 
    single_model_mode: false 
    config_configmap_name: 'ovms-config' 
    config_path: '/config/ovms_config.json' 
  server_settings: 
    file_system_poll_wait_seconds: 0 
    log_level: INFO 
  deployment_parameters: 
    replicas: 1

Create ModelServer resource:

kubectl apply -f https://raw.githubusercontent.com/openvinotoolkit/operator/main/examples/hpa_custom_metrics/ovms.yaml

Deploy and Configure Prometheus

Next, we need to read serving metrics and expose them to the Horizontal Pod Autoscaler. To do this we will deploy Prometheus to collect serving metrics and the Prometheus Adapter to expose them to the autoscaler.  

Deploy Prometheus Monitoring Tool

Let’s start with Prometheus. In the example below we deploy a simple Prometheus instance via the Prometheus Operator. To deploy the Prometheus Operator, run the following command:

kubectl create -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.59.2/bundle.yaml

Next, we need to configure role-based access control to give Prometheus permission to access the Kubernetes API:

kubectl apply -f https://raw.githubusercontent.com/openvinotoolkit/operator/main/examples/hpa_custom_metrics/prometheus_rbac.yaml

The last step is to create a Prometheus instance by deploying Prometheus resource:

kubectl apply -f https://raw.githubusercontent.com/openvinotoolkit/operator/main/examples/hpa_custom_metrics/prometheus.yaml 

If the deployment was successful, a Prometheus service should be running on port 9090. You can set up a port forward for this service, enabling access to the web interface via localhost on your machine:

kubectl port-forward svc/prometheus-operated 9090:9090

Now, when you open http://localhost:9090 in a browser you should see the Prometheus user interface. Next, we need to expose the Model Server to Prometheus by creating a ServiceMonitor resource:

kubectl apply -f https://raw.githubusercontent.com/openvinotoolkit/operator/main/examples/hpa_custom_metrics/service_monitor.yaml

Once it’s ready, you should see a demo-ovms target in the Prometheus UI:

Figure 2. Prometheus User Interface with demo-ovms target

Now that the metrics are available via Prometheus, we need to expose them to the Horizonal Pod Autoscaler. To do this, we deploy the Prometheus Adapter.

Deploy Prometheus Adapter

Prometheus Adapter can be quickly installed via helm or step-by-step via kubectl. For the sake of simplicity, we will use helm3. Before deploying the adapter, we will prepare a configuration that tells it how to expose the ovms_requests_streams_ratio metric:  

apiVersion: v1 
kind: ConfigMap 
metadata: 
  name: adapter-config 
  namespace: default 
data: 
  config.yaml: |+ 
    rules: 
    - seriesQuery: 'ovms_current_requests' 
      resources: 
        overrides: 
          namespace: 
            resource: namespace 
          pod: 
            resource: pod 
      name: 
        matches: "ovms_current_requests" 
        as: "ovms_requests_streams_ratio" 
      metricsQuery: avg(avg_over_time(ovms_current_requests{<<.LabelMatchers>>}[1m]) 
 / avg_over_time(ovms_streams{<<.LabelMatchers>>}[1m])) by (<<.GroupBy>>)

Create a ConfigMap:

kubectl apply –f https://raw.githubusercontent.com/openvinotoolkit/operator/main/examples/hpa_custom_metrics/prometheus_adapter_config.yaml

Now that we have a configuration, we can install the adapter:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm repo update 

helm install --set 'prometheus.url=http://prometheus-operated.default.svc' --set 'rules.existing=adapter-config' prometheus-adapter prometheus-community/prometheus-adapter

Keep checking until custom metrics are available from the API:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq 
{ 
  "kind": "APIResourceList", 
  "apiVersion": "v1", 
  "groupVersion": "custom.metrics.k8s.io/v1beta1", 
  "resources": [ 
    { 
      "name": "namespaces/ovms_requests_streams_ratio", 
      "singularName": "", 
      "namespaced": false, 
      "kind": "MetricValueList", 
      "verbs": [ 
        "get" 
      ] 
    }, 
    { 
      "name": "pods/ovms_requests_streams_ratio", 
      "singularName": "", 
      "namespaced": true, 
      "kind": "MetricValueList", 
      "verbs": [ 
        "get" 
      ] 
    } 
  ] 
}

Once you see the output above, you can configure the Horizontal Pod Autoscaler to use these metrics.

Set up Horizontal Pod Autoscaler

As mentioned previously, we will set up autoscaling based on the ovms_requests_streams_ratio metric and target an average value of 1. This will try to keep all streams busy all the time while preventing requests from queueing up. We will set minimum and maximum number of replicas to 1 and 3, respectively, and the stabilization window for both upscaling and downscaling to 120 seconds:

kind: HorizontalPodAutoscaler 
apiVersion: autoscaling/v2 
metadata: 
  name: ovms-hpa 
spec: 
  scaleTargetRef: 
    apiVersion: intel.com/v1alpha1 
    kind: ModelServer 
    name: demo 
  minReplicas: 1 
  maxReplicas: 3 
  metrics: 
  - type: Pods 
    pods: 
      metric: 
        name: ovms_requests_streams_ratio 
      target: 
        type: AverageValue 
        averageValue: 1 
  behavior: 
    scaleDown: 
      stabilizationWindowSeconds: 120 
    scaleUp: 
      stabilizationWindowSeconds: 120

Create HorizontalPodAutoscaler:

kubectl apply -f https://raw.githubusercontent.com/openvinotoolkit/operator/main/examples/hpa_custom_metrics/ovms_hpa.yaml 

Once deployed, you can generate some load for your model and see the results. Below you can see how Horizontal Pod Autoscaler scales the number of replicas by checking its status:

kubectl describe hpa ovms-hpa
Autoscaler Status

This data can also be visualized with a Grafana dashboard:

Metrics Visualized on Grafana Dashboard

As you can see, with OpenVINO Model Server metrics, you can quickly set up inferencing system with monitoring and autoscaling for any model. Moreover, with custom metrics, you can set up autoscaling for inference on any Intel CPUs and GPUs.  

See also:

Read More...
Dariusz
Trawinski

Serving OpenVINO Models using the KServe API Standard

There are many network API specifications for model serving on the market today. Two of the most popular are TensorFlow Serving (TFS) and KServe. Starting with the 2022.2 release, OpenVINO Model Server supports KServe -- meaning both of these common API standards can be used for serving OpenVINO models. This blog explains how to take advantage of either API.

OpenVINO provides an efficient and high-performance runtime for executing deep learning inference. In many situations, AI applications need to delegate inference execution to a remote device or service over a network. There are many advantages to this approach including the ability to scale.

Figure 1. Example of a client sending image input to a model server for image classification.

AI software developers expect the communication interface with a model server to remain stable. In many cases, developers want to perform pre/post-processing on the client side with minimal dependencies. They are reluctant to switch to a different serving implementation if that requires substantial code changes or new dependencies in their applications.

Since the first release in 2018, OpenVINO Model Server has supported the TFS API. And as of 2022, the KServe API is now supported as well.

KServe is a standard designed by several companies across the industry. It has been adopted by model servers like Triton Inference Server and TorchServe. Now the same client can easily switch to use OpenVINO Model Server and leverage the latest optimizations in Intel(R) CPUs and GPUs.

KServe Python Example

Below is a simple example how to use KServe using the Python-based tritonclient.

Create Model Repository

wget https://storage.openvinotoolkit.org/repositories/open_model_zoo/2022.1/models_bin/2/resnet50-binary-0001/FP32-INT1/resnet50-binary-0001.{xml,bin} -P models/resnet/1
tree models
models
	└── resnet
		└── 1        
			├──resnet50-binary-0001.bin        
			└──resnet50-binary-0001.xml

Start OpenVINO Model Server with a ResNet-50 Model:

docker run --rm -d -v $(pwd)/models:/models -p 9000:9000 -p 8000:8000 openvino/model_server:latest --model_name resnet --model_path /models/resnet --port 9000 --rest_port 8000 --layout NHWC:NCHW --plugin_config '{"PERFORMANCE_HINT":"LATENCY"}'

Install Python Client Library

pip install tritonclient[grpc]==2.22.3

Get the Model Metadata

import tritonclient.grpc as grpcclientclient = grpcclient.InferenceServerClient("localhost:9000")print(client.get_model_metadata("resnet"))

Get a Sample Image

wget https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/common/static/images/bee.jpeg

JPEG image of a bee

Run Inference via gRPC Interface with a NumPy File as Input Data

pip install opencv-python
import tritonclient.grpc as grpcclient
 
import cv2
import numpy as np
 
img = cv2.imread("bee.jpeg").astype(np.float32)
 
img = cv2.resize(img, (224, 224))
 
img = img.reshape(1,224,224,3)
 
triton_client = grpcclient.InferenceServerClient("localhost:9000")
 
inputs =[grpcclient.InferInput("0", img.shape, "FP32")]
 
inputs[0].set_data_from_numpy(img)
 
results = triton_client.infer(model_name="resnet", inputs=inputs)
probabilities = results.as_numpy('1463')
print(np.argmax(probabilities))
309

Run Inference via REST Interface with a JPEG File as Input Data

pip install requests
import requests
import json
import numpy as np

image = open("bee.jpeg", 'rb')
image_data = [image.read()]
image.close()

http_session = requests.session()
inference_header = {"inputs":[{"name":"0","shape":[1],"datatype":"BYTES"}]}
inference_header_binary = json.dumps(inference_header).encode()
results = http_session.post("http://localhost:8000/v2/models/resnet/versions/0/infer", inference_header_binary + b''.join(image_data), headers={"Inference-Header-Content-Length":str(len(inference_header_binary))})
probabilities = np.array(json.loads(results.text)['outputs'][0]['data'])
print(np.argmax(probabilities))
309

 

Run Inference via REST Interface with a JPEG File as Input Data using cURL

echo -n '{"inputs" : [{"name" : "0", "shape" : [1], "datatype" : "BYTES"}]}' > request.json
stat --format=%s request.json
66
cat ./bee.jpeg >> request.json
curl --data-binary "@./request.json" -X POST http://localhost:8000/v2/models/resnet/versions/0/infer -H "Inference-Header-Content-Length: 66"

KServe C++ Example

The inference execution is also made easy in C++ based client applications. The examples below show client application execution based on the Triton C++ client library.

Build the Samples:

git clone –b develop https://github.com/openvinotoolkit/model_server
cd model_server/client/cpp/kserve-api 
cmake . && make && cd samples

 

Get the Model Metadata

The compiled application grpc_model_metadata can make a call to gRPC endpoint and query for a server model metadata.

./grpc_model_metadata --grpc_port 9000 --grpc_address localhost --model_name resnet
model metadata:
name: "resnet"
versions: "1"
platform: "OpenVINO"
inputs {
  name: "0"
  datatype: "FP32"
  shape: 1
  shape: 224
  shape: 224
  shape: 3
}
outputs {
  name: "1463"
  datatype: "FP32"
  shape: 1
  shape: 1000
}

Run Inference via gRPC with a JPEG Encoded File as the Input Data

The sample application grpc_infer_resnet is sending the inference requests for a set of images listed inresnet_input_images.txt including their expected classification number in the ImageNet dataset.

./grpc_infer_resnet --images_list resnet_input_images.txt --labels_list resnet_labels.txt --grpc_port 9000  

airliner.jpeg classified as 404 airliner 
arctic-fox.jpeg classified as 279 Arctic fox, white fox, Alopex lagopus 
bee.jpeg classified as 309 bee 
golden_retriever.jpeg classified as 207 golden retriever 
gorilla.jpeg classified as 366 gorilla, Gorilla gorilla 
magnetic_compass.jpeg classified as 635 magnetic compass 
peacock.jpeg classified as 84 peacock 
pelican.jpeg classified as 144 pelican 
snail.jpeg classified as 113 snail 
zebra.jpeg classified as 340 zebra 
Accuracy 100%

======Client Statistics======
Completed request count 10
Cumulative total request time 110.314 ms

In addition to the KServe API, the TFS API can still be used by client applications. This gives you the option to use a range of client libraries like tensorflow-serving-api or the much lighter and simplified ovmsclient.

To help you get started, we provide samples in Python, C++, Java and Go:

In conclusion, it is now easier to connect and AI applications to OpenVINO Model Server. In existing applications, you can even use the same code to take advantage of the benefits of OpenVINO.

Read More...
Dariusz
Trawinski

Deploy AI Inference with OpenVINO™ and Kubernetes

July 19, 2022

Introduction

Model servers play a vital role in bringing AI models from development to production. Models are served via network endpoints which expose APIs to run predictions. These microservices abstract inference execution while providing scalability and efficient resource utilization.

In this blog, you will learn how to use key features of the OpenVINO™ Operator for Kubernetes. We will demonstrate how to deploy and use OpenVINO Model Server in two scenarios:

1. Serving a single model
2. Serving a pipeline of multiple models

101.png

Kubernetes provides an optimal environment for deploying model servers but managing these resources can be challenging in larger-scale deployments. Using our Operator for Kubernetes makes this easier.

Install via OperatorHub

The OpenVINO Operator can be installed in a Kubernetes cluster from the OperatorHub. Just search for OpenVINO and click the 'Install' button.

102.png

Serve a Single OpenVINO Model in Kubernetes

Create a new instance of OpenVINO Model Server by defining a custom resource called ModelServer using the provided CRD. All parameters are explained here.
In the sample below, a fully functional model server is deployed along with a ResNet-50 image classification model pulled from Google Cloud storage.

kubectl apply -f https://raw.githubusercontent.com/openvinotoolkit/operator/main/config/samples/intel_v1alpha1_ovms.y...

A successful deployment will create a service called ovms-sample.

kubectl get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ovms-sample ClusterIP 10.98.164.11 <none> 8080/TCP,8081/TCP 5m30s

Now that the model is deployed and ready for requests, we can use the ovms-sample service with our Python client known as ovmsclient.

Send Inference Requests to the Service

The example below shows how to use the ovms-sample service inside the same Kubernetes cluster where it’s running. To create a client container, launch an interactive session to a pod with Python installed:

kubectl create deployment client-test --image=python:3.8.13 -- sleep infinity
kubectl exec -it $(kubectl get pod -o jsonpath="{.items[0].metadata.name}" -l app=client-test) -- bash

From inside the client container, we will connect to the model server API endpoints. A simple curl command lists the served models with their version and status:

curl http://ovms-sample:8081/v1/config
{
"resnet" : 
{
  "model_version_status": [
   {
    "version": "1",
    "state": "AVAILABLE",
    "status": {
    "error_code": "OK",
    "error_message": "OK"
   }
  }
 ]
}

Additional REST API calls are described in the documentation.

Now let’s use the ovmsclient Python library to process an inference request. Create a virtual environment and install the client with pip:

python3 -m venv /tmp/venv
source /tmp/venv/bin/activate
pip install ovmsclient

Download a sample image of a zebra:

curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/common/static/images/zebra... -o /tmp/zebra.jpeg

The Python code below collects the model metadata using the ovmsclient library:

from ovmsclient import make_grpc_client
client = make_grpc_client("ovms-sample:8080")
model_metadata = client.get_model_metadata(model_name="resnet")
print(model_metadata)

The code above returns the following response:

{'model_version': 1, 'inputs': {'map/TensorArrayStack/TensorArrayGatherV3:0': {'shape': [-1, -1, -1, -1], 'dtype': 'DT_FLOAT'}}, 'outputs': {'softmax_tensor': {'shape': [-1, 1001], 'dtype': 'DT_FLOAT'}}}

Now create a simple Python script to classify the JPEG image of the zebra :

cat >> /tmp/predict.py <<EOL
from ovmsclient import make_grpc_client
import numpy as np
client = make_grpc_client("ovms-sample:8080")
with open("/tmp/zebra.jpeg", "rb") as f:
data = f.read()
inputs = {"map/TensorArrayStack/TensorArrayGatherV3:0": data}
results = client.predict(inputs=inputs, model_name="resnet")
print("Detected class:", np.argmax(results))
EOL

python /tmp/predict.py
Detected class: 341

The detected class from imagenet is 341, which represents `zebra`.

Serve a Multi-Model Pipeline

Now that we have run a simple example of serving a single model, let’s explore the more advanced scenario of a multi-model vehicle analysis pipeline. This pipeline leverages the Directed Acyclic Graph feature in OpenVINO Model Server.

The remaining steps in this demo require `mc` minio client binary and access to an S3-compatible bucket. See the quick start with MinIO for more information about setting up S3 storage in your cluster.

First, prepare all dependencies using the vehicle analysis pipeline example below:

git clone https://github.com/openvinotoolkit/model_server
cd model_server/demos/vehicle_analysis_pipeline/python
make

The command above downloads the required models and builds a custom library to run the pipeline, then places these files in the workspace directory. Copy these files to a shared S3-compatible storage accessible within the cluster (like MinIO). In the example below, the S3 server alias is mys3:

mc cp --recursive workspace/vehicle-detection-0202 mys3/models-repository/
mc cp --recursive workspace/vehicle-attributes-recognition-barrier-0042 mys3/models-repository/
mc ls -r mys3
43MiB models-repository/vehicle-attributes-recognition-barrier-0042/1/vehicle-attributes-recognition-barrier-0042.bin
118KiB models-repository/vehicle-attributes-recognition-barrier-0042/1/vehicle-attributes-recognition-barrier-0042.xml
7.1MiB models-repository/vehicle-detection-0202/1/vehicle-detection-0202.bin
331KiB models-repository/vehicle-detection-0202/1/vehicle-detection-0202.xml

To use the previously created model server config file in `workspace/config.json`, we need to adjust the paths to models and the custom node library. The commands below change the model paths to use our S3 bucket and the custom node library to `/config` directory which will be mounted as a Kubernetes configmap.

sed -i 's/\/workspace\/vehicle-detection-0202/s3:\/\/models-repository\/vehicle-detection-0202/g' workspace/config.json
sed -i 's/\/workspace\/vehicle-attributes-recognition-barrier-0042/s3:\/\/models-repository\/vehicle-attributes-recognition-barrier-0042/g' workspace/config.json
sed -i 's/workspace\/lib/config/g' workspace/config.json

Next, add both the config file and the custom name library to a Kubernetes config map:

kubectl create configmap ovms-pipeline --from-file=config.json=workspace/config.json \
--from-file=libcustom_node_model_zoo_intel_object_detection.so=workspace/lib/libcustom_node_model_zoo_intel_object_detection.so

Now we are ready to deploy the model server with the pipeline configuration. Use kubectl to apply the following ovms-pipeline.yaml configuration:

apiVersion: intel.com/v1alpha1
kind: ModelServer
metadata:
  name: ovms-pipeline
spec:
  image_name: openvino/model_server:latest
  deployment_parameters:
    replicas: 1
  models_settings:
    single_model_mode: false
    config_configmap_name: "ovms-pipeline"
  server_settings:
    file_system_poll_wait_seconds: 0
    log_level: "INFO"
  service_parameters:
    grpc_port: 8080
    rest_port: 8081
    service_type: ClusterIP
  models_repository:
storage_type: "s3"
    aws_access_key_id: minioadmin
    aws_secret_access_key: minioadmin
    aws_region: us-east-1
    s3_compat_api_endpoint: http://mys3.example.com:9000
kubectl apply -f ovms-pipeline.yaml

This creates the model serving service

kubectl get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ovms-pipeline ClusterIP 10.99.53.175 <none> 8080/TCP,8081/TCP 26m

To test the pipeline, we can use the same client container as the previous example with a single model. From inside the client container shell, download a sample image to analyze:

curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/common/static/images/cars/... -o /tmp/road1.jpg
104.png
cat >> /tmp/pipeline.py <<EOL
from ovmsclient import make_grpc_client
import numpy as np
client = make_grpc_client("ovms-pipeline:8080")
with open("/tmp/road1.jpg", "rb") as f:
data = f.read()
inputs = {"image": data}
results = client.predict(inputs=inputs, model_name="multiple_vehicle_recognition")
print("Returned outputs:",results.keys())
EOL

Run a prediction using the following command:

python /tmp/pipeline.py
Returned outputs: dict_keys(['colors', 'vehicle_coordinates', 'types', 'vehicle_images', 'confidence_levels'])

The sample code above returns a list of the pipeline outputs without data interpretation. More complete client code samples for vehicle analysis are available on GitHub.

Conclusion

OpenVINO Model Server makes it easy to deploy and manage inference as a service in Kubernetes environments. In this blog, we learned how to run predictions using the ovmsclient Python library with both a single model scenario and with multiple models using a DAG pipeline.

Learn more about the OpenVINO Operator: https://github.com/openvinotoolkit/operator

Check out our other model serving demos.

Read More...
Xiake
Sun

Deploy AI Workloads with OpenVINO™ Model Server across CPUs and GPUs

January 10, 2023

Authors: Xiake Sun, Kunda Xu

1. Introduction

Figure 1. OpenVINO™ Model Server Overview

OpenVINO™ Model Server (OVMS) is a high-performance system for serving models. Implemented in C++ for scalability and optimized for deployment on Intel® architectures, the model server uses the same architecture and API as TensorFlow Serving and KServe while applying OpenVINO™ for inference execution. Inference service is provided via gRPC or REST API, making deploying new algorithms and AI experiments easy.

Docker is the recommended way to deploy OpenVINO™ Model Server. Pre-built container images are available on Docker Hub and Red Hat Ecosystem Catalog.

In this blog, we will introduce how to leverage OpenVINO™ Model Server to deploy AI workload across various hardware platforms, including Intel® CPU, Intel® GPU, and Nvidia GPU.

2. OpenVINO™ Model Server Pre-built Docker Image for Intel® CPU

Pull the latest pre-built OVMS docker image hosted in Docker Hub:

docker pull openvino/model_server:latest

Verify OVMS docker image and OpenVINO™ backend version:

docker run -it openvino/model_server:latest --version

Here is an example output of the command line above:

Figure 2. Example output of OVMS and OpenVINO™ backend version

Download a model and create an appropriate directory structure. For example, a person-vehicle-bike-detection model from Intel’s Open Model Zoo:

export MODEL_DIR=$PWD
mkdir -p workspace/person-vehicle-bike-detection-2000/1
cd workspace/person-vehicle-bike-detection-2000/1
wget https://storage.openvinotoolkit.org/repositories/open_model_zoo/2022.3/models_bin/1/person-vehicle-bike-detection-2000/FP32/person-vehicle-bike-detection-2000.xml
wget https://storage.openvinotoolkit.org/repositories/open_model_zoo/2022.3/models_bin/1/person-vehicle-bike-detection-2000/FP32/person-vehicle-bike-detection-2000.bin
cd $MODEL_DIR

where a model directory structure looks like that:

Figure 3. Example of model directory structure for OVMS

After the model repository preparation, let’s start OVMS to host a person-vehicle-bike-detection-2000 model in the Model Server with Intel® CPU as target device.

docker run -p 30001:30001 -p 30002:30002 -it \
-v ${MODEL_DIR}/workspace:/workspace openvino/model_server:latest \
--model_path /workspace/person-vehicle-bike-detection-2000 \
--model_name person-vehicle-bike-detection-2000 --port 30001 \
--rest_port 30002 --target_device CPU

The parameter “--target_device CPU” specified workload to allocate on Intel® CPU. “--port 30001” set up the gRPC server port as 30001, and “--rest_port 30001” set up the REST server port as 30002. The parameter “--model_path” specified the model directory path in the docker image, while “--model_name” specified which model to host in the model server.

3. Build OpenVINO™ Model Server Benchmark Client

OpenVINO™ Model Server provides a useful tool - Benchmark Client to generate traffic and measure the performance of the model served in OpenVINO™ Model Server. In this blog, you could use Benchmark Client to verify OpenVINO™ model server functionality quickly.

To build the docker image and tag it as benchmark_client as follow:

git clone https://github.com/openvinotoolkit/model_server.git
cd model_server/demos/benchmark/python
docker build . -t benchmark_client

Here is an example to use benchmark_client to generate 8 requests and send them via gRPC API, then receive the severed model performance data:

docker run --network host benchmark_client -a localhost -r 30002 \
-m person-vehicle-bike-detection-2000 -p 30001 -n 8 --report_warmup --print_all

In the output, "window_netto_frame_rate" measures the overall performance of a service - how many frames per second the model server processed. Please note, model serving example above was set up with default parameters, see the performance tuning section for more details.

4. Build OpenVINO™ Model Server from Source Code

Download the model server source code as follows:

git clone https://github.com/openvinotoolkit/model_server.git
cd model_server

OVMS provides a “Makefile” to build the docker image with environment parameters, which you can pass via the command line for the building process.

  • BASE_OS: base OS docker image used to build OVMS docker image, current supported values are “ubuntu” (by default) and “redhat”.
  • OV_USE_BINARY:  control whether to use a pre-built OpenVINO™ binary package for building OVMS docker image. If "OV_USE_BINARY=1", OVMS use a pre-built OpenVINO™ binary package. If "OV_USE_BINARY=0", OpenVINO™ will be built from source code during OVMS building process.
  • DLDT_PACKAGE_URL: If "OV_USE_BINRAY=1", "DLDT_PACKAGE_URL" is used to set the URL path to the pre-built OpenVINO™ binary package
  • GPU: control whether to enable OVMS support for Intel® GPU. By default, “GPU=0” disables OVMS support for Intel® GPU. If "GPU=1", OVMS support for intel® GPU will be enabled.
  • NVIDIA: control whether to enable OVMS support for Nvidia GPU. By default, "NVIDIA=0" disables OVMS support for Nvidia GPU. If "NVIDIA=1", OVMS support for Nvidia GPU will be enabled, which requires building OpenVINO from the source code.
  • OV_SOURCE_BRANCH: If "OV_USE_BINARY=0", "OV_SOURCE_BRANCH" is used to set the target branch or commit hash of OpenVINO source code. The default value is “master”
  • OV_CONTRIB_BRANCH: If "NVIDIA=1", "OV_CONTRIB_BRANCH" is used to set the target branch or commit hash of OpenVINO contrib source code. The default value is “master"

Here is an example of building OVMS with the "releases/2022/3" branch of OpenVINO™ GitHub source code with target device Intel® CPU.

OV_USE_BINARY=0 OV_SOURCE_BRANCH=releases/2022/3 make docker_build

Built docker image will be available in the host as “openvino/model_server:latest”.

5. Build OpenVINO™ Model Server with Intel® GPU Support

Since OpenVINO™ 2022.3 release, OpenVINO™ added full support for Intel’s integrated GPU, Intel’s discrete graphics cards, such as Intel® Data Center GPU Flex Series, and Intel® Arc™ GPU for DL inferencing workloads in the intelligent cloud, edge, and media analytics workloads. OpenVINO™ Model Server 2022.3 also added support for Intel® GPU.  The pre-built OpenVINO™ Model Server docker image with GPU driver for Intel® GPU is available in Docker Hub:

docker pull openvino/model_server:latest-gpu

Here is an example of building OVMS with Intel® GPU support based on the OpenVINO™ source code:

GPU=1 OV_USE_BINARY=0 OV_SOURCE_BRANCH=releases/2022/3 make docker_build

The default GPU driver (version 22.8 for RedHat 8.7 or version 22.35 for Ubuntu 20.04) will be installed during the building process. Built docker image will be available in the host as “openvino/model_server:latest-gpu”.

Here is an example to launch the OVMS docker image with Intel® GPU as target device:

docker run -p 30001:30001 -p 30002:30002 -it --device=/dev/dri \
--group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
-v ${MODEL_DIR}/workspace:/workspace openvino/model_server:latest-gpu \
--model_path /workspace/person-vehicle-bike-detection-2000 \
--model_name person-vehicle-bike-detection-2000 --port 30001 \
--rest_port 30002 --target_device GPU

The parameter “--target_device GPU” specified workload to allocate on Intel® GPU. The parameter “--device /dev/dri” is used to pass the device context. The parameter “--group-add=$(stat -c"%g" /dev/dri/render\* | head -n 1) -u $(id -u):$(id -g)” is used to ensure the model server process security context account with correct permissions to run inference on Intel® GPU.

Here is an example to verify the severed model performance on Intel® GPU with benchmark_client:

docker run --network host benchmark_client -a localhost -r 30002 \
-m person-vehicle-bike-detection-2000 -p 30001 -n 8 --report_warmup --print_all

6. Build OpenVINO™ Model Server with Nvidia GPU Support

OpenVINO™ Model Server can also support Nvidia GPU cards by using NVIDIA plugin from the GitHub repo openvino_contrib. Here is an example of building OVMS with Nvidia GPU support step by step:

First, pull the Nvidia docker base image with the GPU driver, e.g.,“docker.io/nvidia/cuda:11.8.0-runtime-ubuntu20.04”, please ensure to install same GPU driver version in the local host environment.

docker pull docker.io/nvidia/cuda:11.8.0-runtime-ubuntu20.04

Install Nvidia Container Toolkit to expose the GPU driver to docker and restart docker.

# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list \
| sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Build OVMS docker image with Nvidia GPU support.“NVIDIA=1” enables to build OVMS with Nvidia GPU support, and “OV_USE_BINARY=0” enables building OpenVINO from the source code. Besides, “OV_SOURCE_BRANCH=releases/2022/3” refer to the OpenVINO™ GitHub "releases/2022/3" branch, while “OV_CONTRIB_BRANCH=releases/2022/3” refer to the OpenVINO contrib GitHub "releases/2022/3" branch.

NVIDIA=1 OV_USE_BINARY=0 OV_SOURCE_BRANCH=releases/2022/3 \
OV_CONTRIB_BRANCH=releases/2022/3 make docker_build

Built docker image will be available in the host as “openvino/model_server-cuda:latest”.

Here is an example to launch the OVMS docker image with Nvidia GPU as target device:

docker run -p 30001:30001 -p 30002:30002 -it --gpus all \
-v ${MODEL_DIR}/workspace:/workspace openvino/model_server:latest-cuda \
--model_path /workspace/person-vehicle-bike-detection-2000 \
--model_name person-vehicle-bike-detection-2000 --port 30001 \
--rest_port 30002 --target_device NVIDIA

The parameter “--target_device NVIDIA” is specified to allocate workload on NVIDIA GPU. The parameter “--gpu all” flag is used to access all GPU resources available in the host system.

Here is an example to verify the severed model performance on Nvidia GPU with benchmark_client:

docker run --network host benchmark_client -a localhost -r 30002 \
-m person-vehicle-bike-detection-2000 -p 30001 -n 8 --report_warmup --print_all

7. Migration from Triton Inference Server to OpenVINO™ Model Server

KServe, as a robust and extensible cloud-native model server for Kubernetes, is widely adopted by model servers including Triton Inference Server. Since the 2022.3 release, OpenVINO™ Model Server added KServer API that supports REST and gRPC calls. Therefore, OVMS with Nvidia GPU support is fully compatible to receive requests from Triton Inference Client and run inference on Nvidia GPU.

Here is an example to pull the Triton Inference Server docker image:

docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:22.10-py3-sdk

Then you could use perf_client tools in the docker image to send generated workload as requests to OVMS via KServe API with gRPC port, then receive measured performance data on Nvidia GPU.  

./install/bin/perf_client -m person-vehicle-bike-detection-2000 \
-i gRPC -u localhost:30001

The simple example above shows how smoothly developers can migrate their own AI service workload from Triton Inference Server to OpenVINO™ Model Server without any change from the client.  

Read More...
Devang
Aggarwal

Leverage the power of Model Caching in your AI Applications

December 27, 2022
Authors: Devang Aggarwal, Eddy Kim, Preetha Veeramalai

Choosing the right type of hardware for deep learning tasks is a critical step in the AI development workflow. Here at Intel, we provide developers, like yourself, with a variety of hardware options to meet your compute requirements. From Intel® CPUs to Intel® GPUs, there are a wide array of hardware platforms available to meet your needs. When it comes to inferencing on different hardware, the little things matter. For example, the loading of deep learning models, which can be a lengthy process and can lead to a difficult user experience on application startup.

Are there ways to achieve faster model loading time on such devices?

Short answer is, yes, there are ways; one way is to handle the model loading time. Model loading performs several time-consuming device-specific optimizations and network compilations, which can also result in developers seeing a relatively higher first inference latency. These delays can lead to a difficult user experience during application startup. This problem can be solved through a mechanism called Model Caching. Model Caching solves the issue of model loading time by caching the final optimized model directly into a file. Reusing cached networks can significantly reduce the model loading time.

Model Caching

With OpenVINO 2022.3, model caching is currently implemented as a preview feature. To accelerate first inference latency on Intel® GPU, not only should the kernel source code be compiled in a form that can be executed on the GPU, but also various optimization passes must be performed. Kernel caching reuses only the kernels, but model caching reuses even the output of the optimization passes, so the model loading time can be further reduced. Before model caching, kernel caching was used in the same manner: by setting the CACHE_DIR configuration key to a folder where the cache should be stored. Now, to use the preview feature of model caching, set the OV_GPU_CACHE_MODEL environment variable to 1. Since the extension of the cache file created by kernel caching is “cl_cache” and the extension of the cache file created by model caching is “blob”, it is possible to check whether model caching is activated through this.

Note: Currently this is a preview feature with OpenVINO 2022.3. This feature will be fully available in OpenVINO 2023.0.

Developers can now also leverage this preview feature from OpenVINO™ Toolkit in OpenVINO™ Execution Provider for ONNX Runtime, a product that accelerates inferencing of ONNX models using ONNX Runtime API’s while using the OpenVINO™ toolkit as a backend. With the OpenVINO™ Execution Provider, ONNX Runtime delivers better inferencing performance on the same hardware compared to generic acceleration on Intel® CPU, GPU, and VPU. Additionally, by using model caching, OpenVINO™ Execution Provider can speed up the first inference latency of deep learning models on Intel® GPU.

In OpenVINO™ Execution Provider for ONNX Runtime, the model caching feature can been abled by setting the ONNX Runtime config option ‘use_compiled_network’ to True while using the C++/Python API’s. This config option acts like a switch to enable and disable the model caching feature that saves the final optimized model into a .blob file during the very first inference of the model on Intel® hardware.

The blobs are loaded from a directory named ‘ov_compiled_blobs’ relative to the executable path by default. This path however can be overridden using the ONNX Runtime config option ‘blob_dump_path’ which is used to explicitly specify the path where you would like to dump and load the blobs files from when already using theuse_compiled_network (model caching) setting.

Refer to Configuration Options for more information about using these options.

Conclusion

With the Model Caching feature, the deep learning model loading time should significantly decrease. You can now utilize this feature in both the Intel® Distribution of OpenVINO™ Toolkit and OpenVINO™ Execution Provider for ONNX Runtime and experience better first inference latency for your AI models.

Notices & Disclaimers

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

©Intel Corporation.  Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or itssubsidiaries.  Other names and brands maybe claimed as the property of others.  

Read More...
Kunda
Xu

OpenVINO™ optimize Fairseq S2T model

December 14, 2022

OpenVINO™ Optimize Fairseq S2T Model

Introduction

Fairseq is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.

There are 2 steps to generate model ready for OpenVINO™ acceleration:

1. Use torch.export.onnx function convert the “.pt” model to “.onnx” model;

2. Use OpenVINO™ MO toolkit convert the “.onnx” model to “IR” model.

Figure 1. OpenVINO™ optimize Fairseq model workflow

The following graph is the Fairseq framework inference workflow, it defines the model structure by “Model Config”, composes “Model Definition List” through multiple subgraph models, and dynamically loads the submodules in the model inference runtime.

Such as in the S2T task, model consists of two parts: Encoder and Decoder.
·       Encoder is for extracting feature information from audio file.
·       Decoder is for decoding the feature information to generate text information.

Fairseq Inference workflow

The length of audio information will affect the length of the feature information, and the length of the feature information will affect the Decoder submodule loop’s times. Therefore, the structure of the S2T model is dynamically defined according to the length of the input audio.

Figure 2. Fairseq S2T model inference workflow

To optimize Fairseq framework model there’re 4 challenges need to be solved:
-       Fairseq define submodules for various function, include variable in model layer define.
-       Model structure is dynamically loaded in runtime and can’t export a whole torch model graph.
-       Encoder and Decoder part models’ input shapes are dynamic, depending on input data size.
-       Decoder part loop times depends by input sequence lengths.

OpenVINO™ optimize Fairseq workflow

So that we should use some optimization tricks to solve these problems, to make sure the pipeline optimized by OpenVINO™.
-       Divide model into Encoder and Decoder two parts, and separately export to onnx model,
-       Because of the model structure define by input seq_len, should export dynamic shape onnx model.
-       Convert onnx to IR model by OpenVINO™ MO toolkit.
-       Replace the Fairseq S2T task pipeline Encoder and Decoder into IR model.
-       Loading Inference Engine to run pipeline the pipeline on OpenVINO™.

Figure 3. OpenVINO™ optimize Fairseq S2T model workflow

Requirement

-       Fairseq is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks
-       OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference which can boost deep learning performance in computer vision, automatic speech recognition, natural language processing and other common task.
-       Python version >=3.8
-       PyTorch version >=1.10.0
Reference: GitHub: Fairseq-OpenVINO

Quick Start Demo

Step 1. Install fairseq and requirement

# Install Fairseq
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
#(Optional) Install the latest stable release (0.10.x)
pip install fairseq

#Install OpenVINO™
Reference: Install OpenVINO by source code for Linux
Reference: Install OpenVINO by release package

Step 2. Download audio file and pre-train model file

In this blog we refer the “S2T Example: STon CoVoST” as sample, Preparation dataset and pre-train model can follow the Fairseq original step. Also, you can use “torch audio” to convert audio file to build customer dataset.

import torchaudio
# load tensor from file
waveform, sample_rate = torchaudio.load('foo.arm') 
# save tensor to file
torchaudio.save('foo_save.wav', waveform, sample_rate)

Step 3. Modify code to export onnx

Torch model export to onnx, We should adjust the contents in fairseq/sequence_generator.py +781 line "self.save_onnx = True" , +782 line "self.openvino_engine = False" The encoder.onnx and decoder.onnx will save in models

python fairseq_cli/interactive.py datasets/en/ --config-yaml config_st_en_de.yaml --task speech_to_text --path models/covost2_fr_en_st_transformer_s.pt --max-tokens 50000 --beam 1

Encoder part model export to dynamic onnx

Figure 4. Encoder part model export to dynamic onnx

Decoder part model export to dynamic onnx

Figure 5. Decoder part model export to dynamic onnx

Step 4. Convert Model to IR

Convert encoder.onnx and decoder.onnx to encoder.xml and decoder.xml

# Convert encoder onnx to IR
mo -m encoder.onnx --input "onnx::Transpose_0[-1,-1,-1],src_lengths[-1]"
# Convert decoder onnx to IR
mo -m decoder.onnx --input "prev_output_tokens[-1,-1],onnx::MatMul_1[-1,-1,-1]"

Step 5. OpenVINO™ Inference Engine optimize S2T pipeline

OpenVINO™ Inference S2T pipeline We should adjust the contents in fairseq/sequence_generator.py +781 line "self.save_onnx = False" , +782 line "self.openvino_engine =True" Use the converted the model to run OpenVINO™ Inference S2T pipeline.

python fairseq_cli/interactive.py datasets/en/ --config-yaml config_st_en_de.yaml --task speech_to_text --path models/covost2_fr_en_st_transformer_s.pt --max-tokens 50000 --beam 1

OpenVINO™ Inference Engine initialization

Figure 6. OpenVINO™ Inference Engine Initialization

Encoder part inference by OpenVINO™

Figure 7. Encoder part inference by OpenVINO™
Figure 8. Encoder part inference by OpenVINO™

Decoder part inference by OpenVINO™

Figure 9. Decoder part inference by OpenVINO™
Figure 10. Decoder part inference by OpenVINO™

Inference Result

Figure 11. Inference Result
Read More...
Xiake
Sun

Accelerate Inference of Hugging Face Transformer Models with Optimum Intel and OpenVINO™

Authors: Xiake Sun, Kunda Xu

1. Introduction

Figure 1. Hugging Face Optimum Intel

Hugging Face is a large open-source community that quickly became an enticing hub for pre-trained deep learning models across Natural Language Processing (NLP), Automatic Speech Recognition(ASR), and Computer Vision (CV) domains.

Optimum Intel provides a simple interface to optimize Transformer models and convert them to OpenVINO™ Intermediate Representation (IR) format to accelerate end-to-end pipelines on Intel® architectures using OpenVINO™ runtime.

Sentimental classification, as one of the popular NLP tasks, is the automated process of identifying opinions in the text and labeling them as positive or negative. In this blog, we use DistilBERT for the sentimental classification task as an example to show how Optimum Intel helps to optimize the model with Neural Network Compression Framework (NNCF) and accelerate inference with OpenVINO™ runtime.

2. Setup Environment

Install optimum-intel and its dependency in a new python virtual environment as follow:

conda create -n optimum-intel python=3.8
conda activate optimum-intel
python -m pip install torch==1.9.1 onnx py-cpuinfo
python -m pip install optimum[openvino,nncf]

3. Model Inference with OpenVINO™ Runtime

The Optimum inference models are API compatible with Hugging Face Transformers models, which means you could simply replace Hugging Face Transformer “AutoModelXXX” class with the “OVModelXXX” class to switch model inference with OpenVINO™ runtime. Besides, you could set “from_transformers=True” when loading the model with the from_pretrained() method, loaded model will be automatically converted to an OpenVINO™ IR for inference with OpenVINO™ runtime.

Here is an example of how to perform inference with OpenVINO™ runtime for a sentimental classification task, the output of the pipeline consists of classification label (positive/negative) and corresponding confidence.

from optimum.intel.openvino import OVModelForSequenceClassification
from transformers import AutoTokenizer, pipeline

model_id = "distilbert-base-uncased-finetuned-sst-2-english"
hf_model = OVModelForSequenceClassification.from_pretrained(
    model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
hf_pipe_cls = pipeline("text-classification",
                       model=hf_model, tokenizer=tokenizer)
text = "He's a dreadful magician."
fp32_outputs = hf_pipe_cls(text)
print("FP32 model outputs: ", fp32_outputs)

4. Model Quantization with NNCF framework

Most deep learning models are built using 32 bits floating-point precision (FP32). Quantization is the process to represent the model using less memory with minimal accuracy loss. To further optimize model performance on Intel® architecture via Intel® Deep Learning Boost, model quantization as 8 bits integer precision (INT8) is required.

Optimum Intel enables you to apply quantization on Hugging Face Transformer Models using the NNCF. NNCF provides two mainstream quantization methods - Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT).

  • Post-Training Quantization (PTQ) refers to quantizing a model with a representative calibration dataset without fine-tuning.
  • Quantization-Aware Training (QAT) is applied to simulate the effects of quantization during training to mitigate its effect on the model’s accuracy

4.1. Model Quantization with NNCF PTQ

NNCF Post-training static quantization introduces an additional calibration step where data is fed through the network to compute the activations quantization parameters. Here is how to apply static quantization on a pre-trained DistilBERT using General Language Understanding Evaluation (GLUE) dataset as the calibration dataset:

from functools import partial
from optimum.intel.openvino import OVQuantizer, OVConfig
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

def preprocess_fn(examples, tokenizer):
    return tokenizer(
        examples["sentence"], padding=True, truncation=True, max_length=128
    )

quantizer = OVQuantizer.from_pretrained(model)
calibration_dataset = quantizer.get_calibration_dataset(
    "glue",
    dataset_config_name="sst2",
    preprocess_function=partial(preprocess_fn, tokenizer=tokenizer),
    num_samples=100,
    dataset_split="train",
    preprocess_batch=True,
)

# Load the default quantization configuration
ov_config = OVConfig()

# The directory where the quantized model will be saved
save_dir = "nncf_ptq_results"
# Apply static quantization and save the resulting model in the OpenVINO IR format
quantizer.quantize(calibration_dataset=calibration_dataset,
                   save_directory=save_dir, quantization_config=ov_config)

The quantize() method applies post-training static quantization and export the resulting quantized model to the OpenVINO™ Intermediate Representation (IR), which can be deployed on any target Intel® architecture.

4.2. Model Quantization with NNCF QAT

Quantization-Aware Training (QAT) aims to mitigate model accuracy issue by simulating the effects of quantization during training. If post-training quantization results in accuracy degradation, QAT can be used instead.

NNCF provides an “OVTrainer” class to replace Hugging Face Transformer’s “Trainer” class to enable quantization during training with additional quantization configuration. Here is an example on how to fine-tune a DistilBERT with Stanford Sentiment Treebank (SST) dataset while applying quantization aware training (QAT):

import numpy as np
import evaluate
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, default_data_collator
from optimum.intel.openvino import OVConfig, OVTrainer

model_id = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
dataset = load_dataset("glue", "sst2")
dataset = dataset.map(
    lambda examples: tokenizer(examples["sentence"], padding=True, truncation=True, max_length=128), batched=True
)
metric = evaluate.load("accuracy")

def compute_metrics(p): return metric.compute(
    predictions=np.argmax(p.predictions, axis=1), references=p.label_ids
)

# The directory where the quantized model will be saved
save_dir = "nncf_qat_results"

# Load the default quantization configuration
ov_config = OVConfig()

trainer = OVTrainer(
    model=model,
    args=TrainingArguments(save_dir, num_train_epochs=1.0,
                           do_train=True, do_eval=True),
    train_dataset=dataset["train"].select(range(300)),
    eval_dataset=dataset["validation"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
    ov_config=ov_config,
    feature="sequence-classification",
)
train_result = trainer.train()
metrics = trainer.evaluate()
trainer.save_model()

4.3. Comparison of FP32 and INT8 model outputs

“OVModelForXXX” class provided the same API to load FP32 and quantized INT8 OpenVINO™ models by setting “from_transformers=False”. Here is an example of how to load quantized INT8 models optimized by NNCF and inference with OpenVINO™ runtime.

ov_ptq_model = OVModelForSequenceClassification.from_pretrained(“nncf_ptq_results”, from_transformers=False)
ov_ptq_pipe_cls = pipeline("text-classification", model=ov_ptq_model, tokenizer=tokenizer)
ov_ptq_outputs = ov_ptq_pipe_cls(text)
print("PTQ quantized INT8 model outputs: ", ov_ptq_outputs)

ov_qat_model = OVModelForSequenceClassification.from_pretrained("nncf_qat_results", from_transformers=False)
ov_qat_pipe_cls = pipeline("text-classification", model=ov_qat_model, tokenizer=tokenizer)
ov_qat_outputs = ov_qat_pipe_cls(text)
print("QAT quantized INT8 model outputs: ", ov_qat_outputs)

Here is an example for sentimental classification output of FP32 and INT8 models:

Figure 2. Outputs example of FP32 model and quantized INT8 models

5. Mitigation of accuracy issue cause by saturation

8-bit instructions of old CPU generations (based on SSE,AVX-2, AVX-512 instruction sets) are prone to so-called saturation(overflow) of the intermediate buffer when calculating the dot product, which is an essential part of Convolutional or MatMul operations. This saturation can lead to a drop in accuracy when running inference of 8-bit quantized models on the mentioned architectures. The problem does not occur on GPUs or CPUs with Intel® Deep Learning Boost (VNNI) technology and further generations.

In the case a significant difference in accuracy (>1%) occurs after quantization with NNCF default quantization configuration, here is an example code to check if deployed platform supports Intel® Deep Learning Boost (VNNI) and further generations:

import cpuinfo
flags = cpuinfo.get_cpu_info()['flags']
brand_raw = cpuinfo.get_cpu_info()['brand_raw']
w = "without"
overflow_fix = 'enable'
for flag in flags:
    if "vnni" in flag or "amx_int8" in flag:
        w = "with"
        overflow_fix = 'disable'
print("Detected CPU platform {0} {1} support of Intel(R) Deep Learning Boost (VNNI) technology \
    and further generations, overflow fix should be {2}d".format(brand_raw, w, overflow_fix))

While quantizing activations use the full range of 8-bit data types, there is a workaround using only 7 bits to represent weights (of Convolutional or Fully-Connected layers) to mitigate saturation issue for many models on old CPU platform.

NNCF provides three options to deal with the saturation issue. The options can be enabled in the NNCF quantization configuration using the “overflow_fix” parameter:

  • "disable": (default) option do not apply saturation fix at all
  • "enable": option to apply for all layers in the model
  • "first_layer_only": option to fix saturation issue for the first layer

Here is an example to enable overflow fix in quantization configuration to mitigate accuracy issue on old CPU platform:

from optimum.intel.openvino.configuration import DEFAULT_QUANTIZATION_CONFIG

ov_config_dict = DEFAULT_QUANTIZATION_CONFIG
ov_config_dict["overflow_fix"] = "enable"
ov_config = OVConfig(compression=ov_config_dict)

After model quantization with updated quantization configuration with NNCF PTQ/NNCF, you can repeat step 4.3 to verify if quantized INT8 model inference results are consistent with FP32 model outputs.

Read More...
Sachin
Rastogi

An end-to-end workflow with training on Habana® Gaudi® Processor and post-training quantization and Inference using OpenVINO™ toolkit

Authors: Sachin Rastogi, Maajid N Khan, Akhila Vidiyala

Background:

Brain tumors are abnormal growths of braincells and can be benign (non-cancerous) or malignant (cancerous). Accurate diagnosis and treatment of brain tumors are critical for the patient's prognosis, and one important step in this process is the segmentation of the tumor in medical images. This involves identifying the boundaries of the tumor and separating it from the surrounding healthy brain tissue.

MRI is a non-invasive imaging technique that uses a strong magnetic field and radio waves to produce detailed images of the brain. MRI scans can provide high-resolution images of the brain, including the location and size of tumors. Traditionally, trained professionals, such as radiologists or medical image analysts, perform manual segmentation of brain tumors. However, this process is time-consuming and subject to human error, leading to the development of automated methods using machine learning.

Introduction:

As demand for deep learning applications grows for medical imaging, so does the need for cost-effective cloud solutions that can efficiently train Deep Learning models. With the Amazon EC2 DL1 instances powered by Gaudi® accelerators from Habana® Labs (An Intel® company), you can train deep learning models for medical image segmentation at a reduced cost of up to 40% than the current generation GPU-based EC2 instances.

Medical Imaging AI solutions often need to be deployed on various hardware platforms, including both new and older systems. The usage of Intel® Distribution of OpenVINO™ toolkit makes it easier to deploy these solutions on systems with Intel® Architecture.

This reference implementation demonstrates how this toolkit can be used to detect and distinguish between healthy and cancerous tissue in MRI scans. It can be used on a range of Intel® Architecture platforms, including CPUs, integrated GPUs, and VPUs, with no need to modify the code when switching between platforms. This allows developers to choose the hardware that meets their needs in terms of performance, cost, and power consumption.

The Challenge:

Identify and separate cancerous tumors from healthy tissue in an MRI scan of the brain with the best price performance.

The Solution:

One approach to brain tumor segmentation using machine learning is to use supervised learning, where the algorithm is trained on a dataset of labelled brain images, with the tumor regions already identified by experts. The algorithm can then learn to identify these tumor regions in new images.

Convolutional neural networks (CNNs) are a type of machine learning model that has been successful in image classification and segmentation tasks and are often used for brain tumor segmentation. In a CNN, the input image is passed through multiple layers of filters that learn to recognize specific features in the image. The output of the CNN is a segmented image, with each pixel classified as either part of the tumor or healthy tissue.

Another approach to brain tumor segmentation is to use unsupervised learning, where the algorithm is not given any labelled examples and must learn to identify patterns in the data on its own. One unsupervised method for brain tumor segmentation is to use clustering algorithms, which can group similar pixels together and identify the tumor region as a separate cluster. However, unsupervised learning is not commonly used for brain tumor segmentation due to the complexity and variability of the data.

Regardless of the approach used, the performance of brain tumor segmentation algorithms can be evaluated using metrics such as dice coefficient, Jaccard index, and sensitivity.

Our medical imaging AI solution is designed to be used widely and in a cost-effective manner. Our approach ensures that the accuracy of the model is not compromised while still being affordable. We have used a U-Net 2D model that can be trained using the Habana® Gaudi® platform and the Medical Decathlon dataset (BraTS 2017 Brain Tumor Dataset) to achieve the best possible accuracy for image segmentation. The model can then be used for inferencing with the OpenVINO™ on Intel® Architecture.

This reference implementation provides an AWS*cloud-based generic AI workflow, which showcases U-Net-2D model-based image segmentation with the medical decathlon dataset. The reference implementation is available for use by Docker containers and Helm chart.

Architecture Diagram

Training:

Primarily, we are leveraging AWS* EC2 DL1workflows to train U-Net 2D models for the end-to-end pipeline. We are consistently seeing cost savings compared to existing GPU-based instances across model types, enabling us to achieve much better Time-to-Market for existing models or training much larger and more complex models.

AWS*DL1 instances with Gaudi® accelerators offer the best price-performance savings compared to other GPU offerings in the market. The models were trained using the Pytorch framework.

The reference training code with detailed instructions is available here.

Inference and Optimization:

Intel® OpenVINO™ is an inference solution that optimizes and accelerates the computation of AI workloads on Intel® hardware. The trained Pytorch models were converted to ONNX (Open Neural Network Exchange) model representation format and then further optimized to the OpenVINO™ format or Intermediate representation (IR) of OpenVINO™ using the Model Optimizer tool from OpenVINO™.

TheFP32-optimized IR models outperformed using OpenVINO™ runtime in terms of throughput compared to other Deep Learning framework runtimes on the same Intel® hardware.

Asa next step, the FP32 IR model was further optimized and converted to lower8-bit precision with post-training quantization using the default quantization algorithm from the Post Training Optimization Tool (POT) from the OpenVINO™ toolkit. This inherently leads to a jump in the model’s performance, in terms of its processing speed and throughput, for you get a higher FPS when dealing with video streams with very negligible loss in accuracy.

TheINT8 IR models performed extremely well for inference on Intel® CPU(Central Processing Unit) 3rd Generation Intel® Xeon.

The reference inference code with detailed instructions is available here.

GitHub: https://github.com/intel/cv-training-and-inference-openvino/tree/main/gaudi-segmentation-unet-ptq

Developer Catalog: https://www.intel.com/content/www/us/en/developer/articles/reference-implementation/gaudi-processor-training-and-inference-openvino.html

Inference Result:

We are using OpenVINO™ Model Optimizer(MO) to convert the trained ONNX FP32 model to FP32 OpenVINO™ or Intermediate Representation(IR) format. The FP32prediction shown here is from a test image from the training dataset which was never used for training. The prediction is from a trained model which was trained for 8 epochs with 8 HPU multi-card training on an AWS* EC2 DL1 Instance with 400/484 images from the training folder.

FP32 Sample Output Prediction

Quantization (Recommended to use if you need the better performance of the model)

Quantization is the process of converting a deep learning model’s weight to a lower precision requiring less computation. This inherently leads to an increase in model performance, in terms of its processing speed and throughput, you will see a higher throughput(FPS) when dealing with video streams. We are using OpenVINO™ POT for the Default Quantization Algorithm to quantize the FP32 OpenVINO™ format model into the INT8 OpenVINO™ format model.

The INT8 prediction shown here is from a testimage from a training dataset that was never used for training. The predictionis from a quantized model which we quantized using POT with a calibrationdataset of 300 samples.

INT8 Sample Output Prediction

This application is available on the Intel® Developer Catalog for the developers to use as it is or use as a base code to bootstrap their customized solution. Intel® Developer Catalog offers reference implementations and software packages to build modular applications using containerized building blocks. Using the containerized building blocks the developers can rapidly develop deployable solutions.

Conclusion:

In conclusion, brain tumor segmentation using machine learning can help improve the accuracy and efficiency of the diagnosis and treatment of brain tumors.

There are several challenges and limitations to using machine learning for brain tumor segmentation. One of the main challenges is the limited availability of annotated data, as it is time consuming and expensive to annotate large datasets of medical images. In addition, there is a high degree of variability and complexity in the data, as brain tumors can have different shapes, sizes, and intensity patterns on MRI scans. This can make it difficult for the machine learning algorithm to generalize and accurately classify tumors in new data.

Another challenge is the potential for bias in the training data, as the dataset may not be representative of the entire population. This can lead to inaccurate or biased results if the algorithm is not properly trained or validated.

While there are still challenges to be overcome, the use of machine learning in medical image analysis shows great promise for improving patient care.

Notices & Disclaimers:

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure. 

Your costs and results may vary. 

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.  Other names and brands may be claimed as the property of others.   

Read More...