OpenVINO Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Sort By:

Optimizing Speech Emotion Recognition: SpeechBrain Meets OpenVINO™

Authors: Pradeep Sakhamoori, Ravi Panchumarthy

Want to analyze emotions in speech recordings but keep your AI application lean and mean? This blog post dives into combining the power of SpeechBrain's pre-trained emotion recognition models with OpenVINO™ for efficient inference. We'll explore how to leverage SpeechBrain's"emotion-recognition-wav2vec2-IEMOCAP" model and optimize it for blazing-fast performance using OpenVINO™.

Getting Started with SpeechBrain
SpeechBrain is a powerful open-source toolkit for developing Conversational AI technologies, including speech recognition, speaker recognition, and emotion recognition. In this blog post, we'll explore using SpeechBrain's pre-trained "emotion-recognition-wav2vec2-IEMOCAP" model to classify emotions in speech recordings and optimize this model for efficient inference using the OpenVINO™toolkit. 

SpeechBrain's Emotion Recognition Model and the IEMOCAP Dataset:

The "emotion-recognition-wav2vec2-IEMOCAP"model is fine-tuned on the IEMOCAP dataset, which contains approximately 12hours of audiovisual data with recordings of dialogues by 10 speakers portraying various emotions, including angry, excited, fear, sad, surprised, frustrated, happy, disappointed, and neutral. 

 The model is based on the wav2vec2 architecture, combining convolutional and residual blocks. The embeddings are extracted using attentive statistical pooling, and the system is trained with Additive Margin Softmax Loss. 

Loading Custom Models with SpeechBrain's foreign_class Function:

The ‘foreign_class’ function in SpeechBrain is a utility that allows you to load and use custom PyTorch models within the SpeechBrain ecosystem. It provides a convenient way to integrate external or custom-built models into SpeechBrain's inference pipeline without modifying the core SpeechBrain codebase.

Here’s how you can load and use the “emotion-recognition-wav2vec2-IEMOCAP”model with foreign_class:

classifier = foreign_class(
# Initialize wav2vec2 torch model
torch_model = classifier.mods["wav2vec2"].model

# Run Inference
out_prob, score, index, text_lab = classifier.classify_file("speechbrain/emotion-recognition-wav2vec2-IEMOCAP/anger.wav")

print(f"Emotion Recognition with SpeechBrain PyTorch model: {text_lab}")

  • source: This argument specifies the source or location of the pre-trained model checkpoint. In this case, "speechbrain/emotion-recognition-wav2vec2-IEMOCAP"refers to a pre-trained model checkpoint on the Hugging Face Hub.
  • pymodule_file: This argument is the path to a Python file containing the definition of your custom PyTorch model class. In this example, "" is the Python file name defining theCustomEncoderWav2vec2Classifier class.
  • classname: This argument specifies the name of the custom PyTorch model class defined in the pymodule_file. In this case,"CustomEncoderWav2vec2Classifier" is the name of the class that extends SpeechBrain's Pretrained class and implements the necessary methods for inference.
  • classifier.classify_file: This is the Inference function call for emotion classification on an audio file.

Optimizing with OpenVINO™

To enhance the performance of our emotion recognition model, we leverage the OpenVINO™ toolkit.  OpenVINO™ empowers developers to write code once and deploy it across diverse Intel® hardware and environments. This includes on-premises, on-device, cloud, and browser deployments. You can also configure performance optimization parameters based on the use case, hardware, and target performance (latency/throughput). For more details, refer to OpenVINO runtime optimizations.

Refer to the OpenVINO SpeechBrain notebook for full code implementation.

Following are the key steps to optimize the model using OpenVINO™:

Step 1: Convert the model to OpenVINO format:

Below is a code snippet illustrating the conversion of the SpeechBrain PyTorch model to OpenVINO IR format using openvino.convert_model Python API.

import openvino as ov
ov_model = ov.convert_model(torch_model,example_input=input_tensor)

Step 2: Run Inference with OpenVINO™ Inference Engine:

After converting the model to OpenVINO format, compile the converted model for your target device and run inference. Below is a sample inference code snippet from the OpenVINO SpeechBrain notebook. For details on Inference Devices and Modes, see optimize-inference.

# OpenVINO Compiled model
compiled_model = core.compile_model(ov_model, device.value)

# Perform model inference
output_tensor = compiled_model(wavs)[0]


Integrating SpeechBrain’s pre-trained models with custom interfaces and optimizing them using OpenVINO™ can significantly enhance the efficiency of your AI applications. This approach not only improves model performance but also ensures seamless deployment across different hardware platforms. By following the steps outlined above, you can build a robust SpeechBrain Emotion Recognition Model optimized with OpenVINO™ runtime that is both powerful and efficient.

Call to Action:

Notices and Disclaimers:

Performance varies by use, configuration, and other factors. Learn more at​.​Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available ​updates. No product or component can be absolutely secure. ​​​​Intel technologies may require enabled hardware, software or service activation.​​​​​​​​

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.​​

No items found.

Large Language Model Graph Customization with OpenVINO™ Transformations API

April 15, 2024

Authors: Xiake Sun, Wenyi Zou, Fiona Zhao


A Large Language Model (LLM) is a type of artificial intelligence algorithm that uses deep learning techniques and massively large data sets to understand, summarize, generate and predict new content.

OpenVINO™ optimizes the deployment of LLMs, enhancing their performance and integration into various applications. We already provide general guide to use LLMs with OpenVINO, from model loading and conversion to advanced use cases.

In this blog, we will introduce some useful method to customize Large Language model’s graph with OpenVINO™ transformation API.

OpenVINO™ Runtime has three main transformation types:

  • Modelpass: straightforward way to work with ov::Model directly
  • Matcherpass: pattern-based transformation approach
  • Graphrewrite pass: container for matcher passes needed for efficient execution.
Figure1. OpenVINO™ transformations API structure overview

In this blog, we mainly use ov::pass::MatcherPassto customize model subgraph via pattern-based transformation.

Here are common steps to implement graph customization using ov::pass::MatcherPass.

  1. Create a pattern
  2. Implement a callback
  3. Register the pattern and Matcher
  4. Execute MatcherPass

In this blog, we will use an open-source LLMs Qwen1.5-7B-Chat-GPTQ-Int4 from Alibaba Cloud with guide for model conversion and graph customization methods.

Qwen Pytorch to OpenVINO™ Model conversion

Here we can use openvino.genai repo to convert Qwen1.5 GPTQ INT4 Pytroch model to OpenVINO™model.

conda create -n openvino.genai python=3.10
git clone
cd llm_bench/python
pip install -r requirements.txt
python –model_id Qwen/Qwen1.5-7B-Chat-GPTQ-Int4 --output_dir  Qwen1.5-7B-Chat-GPTQ-Int4-OV --precision FP16 

Converted model can be find in path “Qwen1.5-7B-Chat-GPTQ-Int4-OV/pytorch/dldt/GPTQ_INT4-FP16/".

Insert custom layer to OpenVINO model

Vocabularysize in the context of LLMs refers to the total number of unique words, or tokens, that the model can recognize and use. The larger the vocabulary size,the more nuanced and detailed the model’s understanding of language can be,however, it also requires more computational and memory resources for deployment.  E.g. Qwen’s vocabulary size(151936) is almost 5x that Llama2 (32000), therefore additional optimization is required for efficient deployment.

We found that the following pattern existed in the Qwen model in Figure 2:

Figure 2: Workflow to reduce MatMul computation and memory usage for logits in Qwen model

To compute the first token generation for the input prompt with shape [1, seq_length], we need to calculate a MatMul operation based on two inputs.

  • First input is a reshape node output with shape[1, seq_length, 4096]
  • Second input is a constant value that contains the model’s vocabulary with shape [4096,151936]

Then Matmul calculates two inputs [1, seq_length, 4096] * [4096,151936] to output large logits [1, seq_length,151936]. However, for the next token prediction, we only need the last element [1,4096] in 1st dimension from logits for sampling.

The main idea is to insert a slice operation between Reshape and Matmul nodes to extract only the last element in 2nd dimension of reshape node output as the first input with shape [1,4096] for computation. Therefore, Matmul computation can be reduced from [1, seq_len, 4096] * [1, 4096, 151936] = [1, seq_len, 151936] to [1, 1, 4096] *[4096, 151936] = [1, 1, 151936], which can reduce first token latency and memory consumption.

Here is a sample code to implement the workflow defined in Figure2 to reduce Qwen's last Matmul computation and memory usage:

# -*- coding: utf-8 -*-
import numpy as np
import openvino as ov
from openvino.runtime import Core, Type
from openvino.runtime.passes import Manager, MatcherPass, WrapType, Matcher
from openvino.runtime import opset10 as ops
from openvino.preprocess import PrePostProcessor

class InsertSlice(MatcherPass):
    def __init__(self):
        self.model_changed = False

        param = WrapType("opset10.Result")

        def callback(matcher: Matcher) -> bool:
            root = matcher.get_match_root()
            print("root: ", root)
            if root is None:
                return False
            root_output = matcher.get_match_value()
            print("root_output", root_output)
            root_name = root.get_friendly_name()
            if (len(root.get_output_partial_shape(0)) == 3):
                print(f"Find target root node name: {root_name}")
                parent = root.input_value(0).get_node()
                print(f"Find target parent node name: {parent.get_friendly_name()}")
                grand_parent = parent.input_value(0).get_node()
                print(f"Find grandparent node name: {grand_parent.get_friendly_name()}")
                grand_parent_output = parent.input(0).get_source_output()
                print("grand_parent_output: ", grand_parent_output)
                consumers = grand_parent_output.get_target_inputs()
                print(f"consumers: {consumers}")
                print("Original reshape node output shape:", grand_parent_output.get_partial_shape())
                start = np.array([0, -1, 0], dtype=np.int32)
                stop = np.array([1, -2, 4096], dtype=np.int32)
                step = np.array([1, -1, 1], dtype=np.int32)
                axes = np.array([0, 1, 2], dtype=np.int32)
                slice = ops.slice(grand_parent, start, stop, step, axes, name="inserted_slice")
                print("After insert slice node, output shape:", slice.output(0).get_partial_shape())

                for consumer in consumers:
                self.model_changed = True
                # Use new operation for additional matching
                return True

        self.register_matcher(Matcher(param,"InsertSlice"), callback)

if __name__ == "__main__":
    model_path = " Qwen1.5-7B-Chat-GPTQ-Int4-OV/pytorch/dldt/GPTQ_INT4-FP16/ openvino_model.xml"
    modified_model_path = "Qwen1.5-7B-Chat-GPTQ-Int4-OV/pytorch/dldt/GPTQ_INT4-FP16/modified_openvino_model.xml")
    core = Core()
    ov_model = core.read_model(model_path)
    manager = Manager()
    ov.save_model(ov_model, modified_model_path)

We defined a OpenVINO™ transformation "InsertSlice" to find the logits (Results) node via ov::pass::MatchPass, then search along root->parent->grandparent node to find the Reshape node. Afterward, we insert a Slice node between the Reshape and Matmul nodes to extract the last element of seq_length with shape [1,1,4096]. In the end, we apply "InsertSlice" transformation to original OpenVINO™ model and save modified model on disk for deployment.

Modify model weights of specified layer in OpenVINO model

In case you want to update certain model layer weights after model training or fine-tuning/compression.

E.g. if you have an INT4 weight-compressed model using another model compression method, e.g. AWQ, you may want to transfer model weights optimized with the quantization method.

The most general method will be to convert the original model to OpenVINO™ model if the model direct conversion works. However, if first option is not works out of box, an alternative option is to replace the model weights from OpenVINO™ models with external fine-tuning model weights.

Figure 3: Update model weights of OpenVINO™ Model with external fine-tuned model weights

Here we introduce a common method to modify layer weights of Qwen model via OpenVINO™ transformation API.

As Figure 3 shows, the goal is to replace model weights and scale of the original Constant node with external fine-tuned weights and scale data.

At first, we use ov::pass::MatchPass method to find the Convert node after the target node. Then we create a new constant node with external weight saved as a numpy array. Please note, GPTQ int4 model weight is saved asuint4 (U4) binary format, while numpy can only represent data with numpy.uint8. Therefore, we use a help function to pack 2 uint4 binary data as 1 uint8 binary data. Then we replace the Convert input port from the original Constant node to the new Constant node.  Since the old constant node has no consumers and is neither the Result nor the Sink operation whose shared pointer counter is zero, the operation will be destructed and not be accessible anymore.

Here is a sample code to implement the workflow defined in Figure3 to replace Qwen Constant node via the new Constant node with external data:

# -*- coding: utf-8 -*-
import numpy as np
import openvino as ov
import torch 
from openvino.runtime import Core, Model, Type
from openvino.runtime.passes import Manager, GraphRewrite, MatcherPass, WrapType, Matcher
from openvino.runtime import opset10 as ops
from openvino.helpers import pack_data, unpack_data
pytorch_to_ov_layer_mapping = [{"__module.model.layers.0.mlp.down_proj/aten::to/Convert": }]
packed_layername_tensor_dict_list = [{"name":"__module.model.layers.0.mlp.down_proj/aten::to/Convert","value":np.ones([1376*4, 4096],dtype=np.uint8)}]

class InsertWeights(MatcherPass):
    def __init__(self,packed_layername_tensor_dict_list):
        self.model_changed = False

        param = WrapType("opset10.Convert")

        def callback(matcher: Matcher) -> bool:
            root = matcher.get_match_root()
            if root is None:
                return False
            root_output = matcher.get_match_value()
            for y in packed_layername_tensor_dict_list:
                #root_name = root.get_friendly_name().replace('.','_')
                root_name = root.get_friendly_name()
                print(f"root_name: {root_name}")
                if root_name.find(y["name"]) != -1 :
                    consumers = root.input_value(0).get_target_inputs()
                    unpacked_data = unpack_data(y["value"],Type.u4,y["value"].shape)
                    new_weights = ops.constant(np.zeros(root.get_output_shape(0)),Type.u4,name=y["name"]+"_new_const")
                    print("new_weights: ", new_weights)
          [:] = unpacked_data.ravel()
                    print(f"new_weights.shape: {new_weights.shape}")
                    for consumer in consumers:

                    # For testing purpose
                    self.model_changed = True
                    # Use new operation for additional matching

            return True

        self.register_matcher(Matcher(param,"InsertWeights"), callback)

if __name__ == "__main__":
    model_path = "Qwen1.5-7B-Chat-GPTQ-Int4-OV/pytorch/dldt/GPTQ_INT4-FP16/openvino_model.xml"
    modified_model_path = "Qwen1.5-7B-Chat-GPTQ-Int4-OV/pytorch/dldt/GPTQ_INT4-FP16/modified_openvino_model.xml")
    core = Core()
    ov_model = core.read_model(model_path)
    manager = Manager()
    ov.save_model(ov_model, modified_model_path)

We defined a OpenVINO™ transformation "InsertWeights" to find the target constant node via ov::pass::MatchPass, then we create a new Constat node with external numpy data and pack it as uint4 OpenVINO™ Tensor to replace original constant node in graph. In the end, we apply "InsertWeights" transformation to original OpenVINO™ model and save modified model on disk for deployment.


In this blog, we introduce how to apply graph customization based on OpenVINO™ model with OpenVINO™ transformation API. Furthermore, we show two examples of inserting layers & modifying layer weights based on Qwen LLM model with simple Python code.



OpenVINO™Transformation API

IntegrateOpenVINO™ with Your Application – Model Representation


Optimizing Whisper and Distil-Whisper for Speech Recognition with OpenVINO and NNCF

January 29, 2024

Authors: Nikita Savelyev, Alexander Kozlov, Ekaterina Aidova, Maxim Proshin


Whisper is a general-purpose speech recognition model from OpenAI. The model can transcribe speech across dozens of languages and even handle poor audio quality or excessive background noise. You can find more information about this model in the research paper, OpenAI blog, model card and GitHub repository.

Recently, a distilled variant of the model called Distil-Whisper has been proposed in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling. Compared to Whisper, Distil-Whisper runs several times faster with 50% fewer parameters, while performing to within 1% word error rate (WER) on out-of-distribution evaluation data.

Whisper is a Transformer-based encoder-decoder model, also referred to as a sequence-to-sequence model. It maps a sequence of audio spectrogram features to a sequence of text tokens. First, the raw audio inputs are converted to a log-Mel spectrogram by action of the feature extractor. Then, the Transformer encoder encodes the spectrogram to form a sequence of encoder hidden states. Finally, the decoder autoregressively predicts text tokens, conditional on both the previous tokens and the encoder's hidden states.

You can see the model architecture in the diagram below:

In this article, we would like to demonstrate how to improve Whisper and Distil-Whisper inference speed with OpenVINO for Intel hardware. Additionally, we show how to make models even faster by applying 8-bit Post-training Quantization with Neural Network Compression Framework (NNCF). In the end we present evaluation results from accuracy and performance standpoints on a large-scale dataset.

All code snippets presented in this article are from the Automatic speech recognition using Distil-Whisper and OpenVINO Jupyter notebook, so you can follow along.

Converting Model to OpenVINO format

We are going to load models from Hugging Face Hub with the help of Optimum Intel library which makes it easier to load and run OpenVINO-optimized models. For more details, pleaes refer to the Hugging Face Optimum documentation.

For example, the following code loads the Distil-Whisper large-v2 model ready for inference with OpenVINO.

from import OVModelForSpeechSeq2Seq

model_id = "distil-whisper/distil-large-v2"
model_path = Path(model_id)
if not model_path.exists():
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_id, export=True, compile=False, load_in_8bit=False)
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_path, compile=False)

Models from the Distil-Whisper family are available at Distil-Whisper Models collection and Whisper models are available at OpenAI Hugging Face page.

To transcribe an input audio with the loaded model, we first compile the model to the device of choice and then call generate() method on input features prepared by corresponding processor.

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(model_id)"AUTO")

# ... load input audio and reference text
input_features = processor(input_audio).input_features
predicted_ids = ov_model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"Reference: {reference_text}")
print(f"Result: {transcription}")

The output is the following. As you can see the transcription equals the reference text.

Result:  Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.

Running Post-Training Quantization with NNCF

NNCF enables post-training quantization by adding quantization layers into the model graph and then using a subset of the training dataset to initialize parameters of these additional quantization layers. During quantization, some layers (e.g., MatMuls, Convolutions) are transformed to be executed in INT8 instead of FP16/FP32. If a quantized operation is parameterized then its corresponding weight variable is also converted to INT8.

In general, the optimization process contains the following steps:

  1. Create a calibration dataset for quantization.
  2. Run nncf.quantize() to obtain quantized encoder and decoder models.
  3. Serialize the INT8 models using openvino.save_model() function.

Whisper model consists of an encoder and decoder submodels. Furthermore, for the decoder model its forward() signature is different for the first call compared to all subsequent calls. During the first call, key-value cache is empty and is not needed for decoder inference. Starting from the second call, key-value cache is fed to the decoder. Because of this, these two cases are represented by two separate OpenVINO models: openvino_decoder_model.xml and openvino_decoder_with_past_model.xml. Since the first decoder model is inferred only once it does not make much sense to quantize it. So, we apply quantization to the encoder and the decoder with past models.

The first step towards quantization is collecting calibration data. For that, we need to collect some number of model inputs for both models. To do that, we patch OpenVINO model request objects with an InferRequestWrapper class instance that will intercept model inputs during inference and store them in a list. We infer the model on about 50 samples from validation split of librispeech_asr dataset.

def collect_calibration_dataset(ov_model: OVModelForSpeechSeq2Seq, calibration_dataset_size: int):
    # Overwrite model request properties, saving the original ones for restoring later
    original_encoder_request = ov_model.encoder.request
    original_decoder_with_past_request = ov_model.decoder_with_past.request
    encoder_calibration_data = []
    decoder_calibration_data = []
    ov_model.encoder.request = InferRequestWrapper(original_encoder_request, encoder_calibration_data)
    ov_model.decoder_with_past.request = InferRequestWrapper(original_decoder_with_past_request,

    calibration_dataset = load_dataset("librispeech_asr", "clean", split="validation", streaming=True)
    for sample in islice(calibration_dataset, calibration_dataset_size):
        input_features = extract_input_features(sample)

    ov_model.encoder.request = original_encoder_request
    ov_model.decoder_with_past.request = original_decoder_with_past_request

    return encoder_calibration_data, decoder_calibration_data

With the collected calibration data for encoder and decoder models we can proceed to quantization itself. Let's examine the quantization call for the encoder model. For the decoder model, it is similar.

quantized_encoder = nncf.quantize(
    ov_model.encoder.model,                     # ov.Model object of the encoder model
    nncf.Dataset(encoder_calibration_data),     # calibration data wrapped in a nncf.Dataset object
    subset_size=len(encoder_calibration_data),  # number of samples to calibrate on (all are chosen)
    model_type=nncf.ModelType.TRANSFORMER,      # providing the information that Whisper encoder is of
    # a Transformer architecture
    advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.50)    # Smooth Quant 
    # algorithm reduces activation quantization error; optimal alpha was obtained through grid search
ov.save_model(quantized_encoder, quantized_model_path / "openvino_encoder_model.xml")

After both models are quantized and saved, the quantized Whisper model can be loaded and run the same way as shown previously. Comparing the transcriptions produced by original and quantized models results in the following.

Original :  Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.
Quantized:  Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.

As you can see for the quantized distil-whisper-large-v2 transcription is the same.

Evaluating on Common Voice Dataset

We evaluate Whisper and Distil-Whisper large-v2 model variants on a Common Voice 13.0 speech-to-text dataset. We use en/test split containing 16372 audio samples amounting to about 27 hours of recordings.

The evaluation is done across three model types: original PyTorch model, original OpenVINO model and quantized OpenVINO model. Additionally, we run tests on three Intel CPUs: Cascade Lake Intel(R) Core(TM) i9-10980XE, Ice Lake Intel(R) Xeon(R) Gold 6338 and Sapphire Rapids Intel(R) Xeon(R) Gold 6430L.

For all combinations above we measure transcription time and accuracy. When measuring time for a model we sum up generate() call durations for all audio samples. Transcription accuracy is represented as Accuracy = (100 - WER), WER stands for Word Error Rate. We compute accuracy for each audio sample and then take the average value across the dataset. The results are given in the table below.

Please note that we report transcription time in relative terms such that the values for each CPU are normalized over its corresponding column. The duration of audio data in the dataset is 27.06 hours and the absolute transcription time values for Whisper large-v2 PyTorch on each CPU are:

  • 20.35 hours for Core i9-10980XE
  • 14.09 hours for Xeon Gold 6338
  • 15.03 hours for Xeon Gold 6430L

Based on the results we can conclude that:

  1. OpenVINO models execute 1.4x - 5.1x faster than PyTorch models with pretty much the same accuracy across all cases.
  2. When compared to original PyTorch models, quantized OpenVINO models provide 2.1x - 6.1x performance boost with 1-2% accuracy drop.

NOTE: in terms of this article we focus on presenting performance values. Accuracy of quantized models can be improved with a more careful selection of calibration data.

Notices and Disclaimers:

Performance varies by use, configuration, and other factors. Learn more at Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. No product or component can be absolutely secure. Intel technologies may require enabled hardware, software or service activation.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Test Configuration: Intel® Core™ i9-10980XE CPU Processor at 3.00GHz with DDR4 128 GB at 3000MHz, OS: Ubuntu 20.04.3 LTS; Intel® Xeon® Gold 6338 CPU Processor at 2.00GHz with DDR4 256 GB at 3200MHz, OS: Ubuntu 20.04.3 LTS; Intel® Xeon® Gold 6430L CPU Processor at 1.90GHz with DDR5 1024 GB at 4800MHz, OS: Ubuntu 20.04.6 LTS. Testing was performed using distil-whisper-asr notebook for model export and whisper evaluation notebook for model evaluation.

The test was conducted by Intel in December 2023.


We demonstrated how to load and run Whisper and Distil-Whisper models for audio transcription task with OpenVINO and Optimum Intel, and how to perform INT8 post-training quantization of these models with NNCF. Further we evaluated these models on a large scale speech-to-text dataset across multiple CPU devices. The evaluation results show a significant performance boost of OpenVINO vs PyTorch models without loss of transcription quality, and even a larger boost with a tolerable accuracy drop when we apply INT8 quantization.


Use Encrypted Model with OpenVINO

November 9, 2023

Deploying deep-learning capabilities to edge devices can present security challenges like ensuring inference integrity, or providing copyright protection of your deep-learning models. OpenVINO provide a simple method with crypto algorithm to protect model in disk. Model encryption, decryption and authentication are not provided by OpenVINO but can be implemented with third-party tools (i.e., OpenSSL). In this example, we use AES-128-cbc algorithm in OpenSSL to demonstrate the model cryptography.

As you can see the mechanism in below image, there are two part to process:

  1. First is to encrypt your plain IR model into encrypted model.
  2. The second part is to use the same password key and IV which used for encryption before to decrypt model at model loading runtime.
The schema of model encryption and decryption by OpenVINO

Step 1: Encrypt model

Make sure you install the OpenSSL and boost, for example in Ubuntu:

$ sudo apt install openssl libboost-dev

Then use command line to do model encryption by OpenSSL AES-128-CBC algorithm. In this simply example, I use same password for Key and IV, it is hexadecimal of string "openvino encrypt". You can use some online str2hex tool to generate hex representation of your string password.

$ openssl enc -aes-128-cbc -in openvino_model.xml -out openvino_model_enc.xml -K 6f70656e76696e6f20656e6372797074 -iv 6f70656e76696e6f20656e6372797074
$ openssl enc -aes-128-cbc -in openvino_model.bin -out openvino_model_enc.bin -K 6f70656e76696e6f20656e6372797074 -iv 6f70656e76696e6f20656e6372797074

Step 2: Decrypt model

Here provide the sample code to read encrypted model into buffer and decrypt to plain model binary. Then read and compile model.

#include <fstream>
#include <iostream>
#include <vector>
#include <cmath>
#include <cctype>
#include <string>
#include <openvino/runtime/core.hpp>
#include <openssl/aes.h>
#include <boost/algorithm/hex.hpp>

using namespace std;

vector<unsigned char> aes_128_cbc_decrypt(
    vector<unsigned char> &cipher,
    std::vector<unsigned char> &key,
    std::vector<unsigned char> iv) {

    AES_KEY ctx;
    AES_set_decrypt_key(, 128, &ctx);
    std::vector<uint8_t> plain;
    //cipherLen = clearLen + 16 - (clearLen mod 16)
    int plain_size = ceil(cipher.size()/16)*16; //make sure alloc buffer is enough to plain_size
    std::cout << "AES_cbc_encrypt start:" << std::endl;
    AES_cbc_encrypt(,, plain.size(), &ctx,, AES_DECRYPT);
    std::cout << "AES_cbc_encrypt done" << std::endl;
    return plain;

void decrypt_file(std::ifstream & stream,
                  std::vector<unsigned char> & key,
                  std::vector<unsigned char> & iv,
                  std::vector<uint8_t> & result) {
    std::vector<unsigned char> cipher((std::istreambuf_iterator<char>(stream)),  std::istreambuf_iterator<char>());
    std::cout << "aes_128_cbc_decrypt" << std::endl;
    std::vector<unsigned char> decrypt_model = aes_128_cbc_decrypt(cipher, key, iv);
    result = decrypt_model;


int main() {
    std::string key_hex = "6f70656e76696e6f20656e6372797074";
    std::string iv_hex = "6f70656e76696e6f20656e6372797074";
    std::vector<unsigned char> key_bytes;
    std::vector<unsigned char> iv_bytes;
    boost::algorithm::unhex(key_hex, std::back_inserter(key_bytes));
    boost::algorithm::unhex(iv_hex, std::back_inserter(iv_bytes));
    std::vector<uint8_t> model_data, weights_data;
    std::ifstream model_file("openvino_model_enc.xml",std::ios::in | std::ios::binary), weights_file("openvino_model_enc.bin",std::ios::in | std::ios::binary);
    // Read model files and decrypt them into temporary memory block
    std::cout << "decrypt file" << std::endl;
    decrypt_file(model_file, key_bytes, iv_bytes, model_data); //key & iv is the same
    decrypt_file(weights_file, key_bytes, iv_bytes, weights_data);
    ov::Core core;
    // Load model from temporary memory block
    std::string str_model(model_data.begin(), model_data.end());
    std::unique_ptr<ov::InferRequest> infer_request= std::make_unique<ov::InferRequest>(core.compile_model(str_model,ov::Tensor(ov::element::u8, {weights_data.size()},,"CPU").create_infer_request());
    std::cout << "compile success" << std::endl;
    return 0;

CMakeLists.txt file like below for compiling:

cmake_minimum_required(VERSION 3.5)
set(CMAKE_BUILD_TYPE "Release" CACHE STRING "CMake build type")
add_compile_options(-O3 -march=native -Wall)

find_package(OpenVINO REQUIRED)
find_package(OpenSSL REQUIRED)
find_package(Boost REQUIRED)

add_executable(model_crypto main.cpp)
target_include_directories(model_crypto PRIVATE ${OV_INCLUDE_DIR} )
target_link_libraries(model_crypto PRIVATE openvino::runtime OpenSSL::SSL Boost::headers)

This blog just provide an example of model encryption by OpenSSL. This method can only protect you model in disk, for total memory crypto, you can refer technologies like OpenVINO™ Security Add-on in virtual machine to provide an isolated environment for security sensitive operations, and use Intel® SGX (Software Guard Extensions) which allows developers to split a computer's memory into private, predefined, highly secure areas called enclaves, which better protect sensitive information.

  1. OpenVINO model protection:
  2. OpenVINO™ Security Add-on:
  3. OpenSSL official website:

Intel® DL Streamer Optimize Media-AI pipeline on Intel® Data Center Flex dGPU by Docker

December 14, 2022

Authors Kunda Xu, Wenyi Zou


In this blog is about How to use DL-streamer to build a complete Media-AI pipeline (Including: Video Access, Media Decode, AI Inference, Media Encode and Result Export). And the pipeline will be accelerated by OpenVINO™ and optimize to run on Flex dGPU(Intel® Data Center Flex dGPU)


- DL-streamer
Intel® Deep Learning Streamer (Intel® DL Streamer)Pipeline Framework is an easy way to construct media analytics pipelines using Intel® Distribution of OpenVINO™ Toolkit. It leverages the open source media framework GStreamer to provide optimized media operations and Deep Learning Inference Engine from OpenVINO™ Toolkit to provide optimized inference.

- OpenVINO
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference which can boost deep learning performance in computer vision, automatic speech recognition, natural language processing and other common task.

- Docker (Optional)
Docker is an open-source platform that enables developers to build, deploy, run, update, and manage containers—standardized, executable components that combine application source code with the operating system (OS) libraries and dependencies required to run that code in any environment.

Install DL-Streamer and OpenVINO™ via Docker

Images for Intel® Data Center GPU Flex Series

Images 2023.0.0-ubuntu22-gpu682* are intended for Intel® Data Center GPU Flex Series and include

1.     Intel®DL Streamer 2023.0.0

2.    OpenVINO™ Toolkit 2023.0.0

3.    Drivers for Intel® Data Center GPU Flex Series, drivers version 682.14

Two images are listed below, images -devel additionally contain samples and development files

Runtime image that includes GStreamer* Pipeline Framework elements, elements built with Intel® oneAPI DPC++/C++ Compiler

docker pull intel/dlstreamer:2023.0.0-ubuntu22-gpu682-dpcpp

Developer image that builds on runtime image containing samples, development files and a model downloader, built with Intel® oneAPI DPC++/C++ Compiler

docker pull intel/dlstreamer:2023.0.0-ubuntu22-gpu682-dpcpp-devel

Taking “dlstreamer:2023.0.0-ubuntu22-gpu682-dpcpp” docker images as a sample to show how to pull the docker image from docker hub.

docker pull intel/dlstreamer:2023.0.0-ubuntu22-gpu682-dpcpp
Fig 1. docker pull images from docker hub

DL-Streamer Media-AI pipeline quick start example

Make sure the pre-requirement had already installed, there is a very basic introduction to using object detection models(yolov5) to build a DL-streamer pipeline.

Step 1.Download video and yolov5s model file

Download video

curl -L -o people_walking_sample.mp4

Download yolov5s-416_INT8 model from pipeline-zoo-models

mkdir yolov5s-416_INT8 && cd yolov5s-416_INT8

Step 2.Enter Docker and copy the files into docker container

Create and enter the docker container

docker run -it --device /dev/dri/ --user root --rm intel/dlstreamer:2023.0.0-ubuntu22-gpu682-dpcpp

Open another terminal for file copy into container ,copy video and model into docker container

sudo docker cp yolov5s-416_INT8/ <Docker CONTAINER ID>:/home/dlstreamer
docker cp people_walking_sample.mp4 <Docker CONTAINER ID>:/home/dlstreamer

Step 3. Run an object detection Media-AI pipeline

By the following script, we can run pipeline the Media-AI objection detection on the Flex dGPU in the docker container.

gst-launch-1.0 filesrc location=/path/to/people_walking_sample.mp4 ! decodebin !  capsfilter caps="video/x-raw(memory:VASurface)" ! gvadetect model=/path/to/yolov5s-416_INT8/yolov5s.xml model_proc=/path/to/yolov5s-416_INT8/yolo-v5.json inference-interval=1 device=GPU.0 batch-size=32 pre-process-backend=vaapi-surface-sharing ! queue ! gvatrack tracking-type=short-term-imageless ! gvafpscounter ! fakesink sync=false
Figure 2. DL-streamer run pipeline on the dGPU

If want to encode the detection result and save as video file, can use the follow script

gst-launch-1.0 filesrc location=/path/to/people_walking_sample.mp4 ! decodebin !  capsfilter caps="video/x-raw(memory:VASurface)" ! gvadetect model=/path/to/yolov5s-416_INT8/yolov5s.xml model_proc=/path/to/yolov5s-416_INT8/yolo-v5.json inference-interval=1 device=GPU.0 batch-size=32 pre-process-backend=vaapi-surface-sharing ! queue ! gvatrack tracking-type=short-term-imageless ! meta_overlay device=GPU ! gvafpscounter ! vaapipostproc ! vaapih265enc rate-control=cbr bitrate=6144  ! filesink location=./encoded_video_track.265 sync=false

The encoded video file will save in the container and can be copied out in new terminal.

docker cp <Docker CONTAINER ID>:/home/dlstreamer encoded_video_track.265 .

Figure 3. DL-streamer yolov5s pipeline result

PS. Instruction about DL-streamer CLI parameter

decodebin: Auto-magically constructs a decoding pipeline using available decoders and demuxers via auto-plugging.

vaapipostproc: Consists in various post processing algorithms to be applied to VA surfaces. For e.g. scaling, deinterlacing (bob, motion-adaptive, motion-compensated), noise reduction or sharpening.

gvadetect: Performs object detection on a full-frame or region of interest (ROI) using object detection models such as YOLO v3-v5, MobileNet-SSD, Faster-RCNN etc. Outputs the ROI for detected objects.

gvatrack: Performs object tracking using zero-term, zero-term-imageless, or short-term-imageless tracking algorithms. Zero-term tracking assigns unique object IDs and requires object detection to run on every frame. Short-term tracking allows to track objects between frames, there by reducing the need to run object detection on each frame. Imageless tracking forms object associations based on the movement and shape of objects, and it does not use image data.

gvafpscounter: Measures frames per second across multiple streams in a single process.

Tuning Tips

Users can refer the different platform using case which were supported by OpenVINO™ and the device profiling API to realize performance tuning of your inference program between CPU, iGPU, dGPU. It will also be helpful to developer finding out the place where has the potential space of performance improvement.


Build OpenVINO on Kylin OS Guide

Authors: Tong Qiu, Wenyi Zou

Kylin is an operating system based on Linux, developed by academics at the National University of Defense Technology in the People's Republic of China. For more information about Kylin OS, please visit the Wikipedia page at Kylin. In the following sections, we will provide a step-by-step guide to building and running OpenVINO on the Kylin Operating System.


The version of Kylin OS we are using is “Kylin HostOS V10”, with the specific version being “V10 (Helium)”. You may obtain this information by executing the command:

cat /etc/*release 

We build OpenVINO using GCC 10.3.1, CMake 3.26.0, and Python 3.9.9, which can all be installed by default using command lines. Next, we will demonstrate how to install these necessary dependencies.

Install Build Dependencies

Instead of executing the ./ script referenced in  Build OpenVINO™ Runtime for Linux systems, you can directly install the build dependencies using the following command lines:

yum update && yum install -y \
   file \
   cmake3 \
   ninja-build \
   scons \
   gcc \
   gcc-c++ \
   make \
   git \
   fdupes \
   rpm-build \
   tbb-devel \
   libva-devel \
   snappy-devel \
   python3-pip \

Setup Python Virtual Environment

The next step is to create and activate a Python virtual environment. While this step is optional, we strongly recommend it to ensure better management of your project's dependencies.

python3 -m venv openvino_env
source openvino_env/bin/activate

Following the completion of the steps to build OpenVINO within the Python virtual environment, you can activate OpenVINO alongside the Python virtual environment each time by executing the source command.

Build OpenVINO with CMake 3.26.0 and GCC 10.3.1

Now we've reached the step to build OpenVINO. First, clone the OpenVINO repository and update its submodules.

git clone
cd openvino && git submodule update --init --recursive

Next, install the Python dependencies that are required for building Python wheels.

python3 -m pip install -U pip 
python3 -m pip install -r ./src/bindings/python/wheel/requirements-dev.txt

Then, create the build directory.

mkdir build && cd build

To build OpenVINO with CMake, start by using the command provided below. For enhanced performance, it is recommended to append the -DCMAKE_CXX_FLAGS=-march=native to your command, as this will enable the compiler to optimize the build for your specific hardware by using all supported instruction subsets. Additionally, if you require a Python wheel, include the corresponding build option. Remember to tailor the CMake parameters to fit your particular requirements.

cmake -DCMAKE_BUILD_TYPE=Release .. -DCMAKE_CXX_FLAGS=-march=native -DENABLE_PYTHON=ON -DPython3_EXECUTABLE=/usr/bin/python3 -DENABLE_WHEEL=ON
cmake --build . --parallel

Once the build process is complete, you can install the generated wheel using the pip command.

pip install <openvino_repo>/build/wheel/openvino-*.whl


Quick test for built openvino runtime and openvino-dev tools

You can quickly verify your built and installed OpenVINO setup. Start by creating a model directory and installing the dependencies for the model optimizer.

mkdir ~/ov_models
pip install onnx protobuf openvino-dev[pytorch]

Next, download the resnet50 pytorch model using omz_downloader.

omz_downloader --name resnet-50-pytorch -o ~/ov_models/

Then, convert resnet50 pytorch model to OpenVINO FP32 IR via omz_converter.

omz_converter --name resnet-50-pytorch -o ~/ov_models/ -d ~/ov_models/

Finally, execute benchmark_app with resnet50 FP32 IR model on CPU.

benchmark_app -m ~/ov_models/public/resnet-50-pytorch/FP32/resnet-50-pytorch.xml -d CPU

Additional Details for OpenVINO Setup

If you prefer to build OpenVINO with a different compiler, such as clang, you can modify the CMake configuration step accordingly. To build with the clang compiler, please refer to the website at Clang - Getting Started for instructions on installation and setup. Below is an example of a CMake generation command that specifies clang as the compiler:

cmake -S .. -DCMAKE_C_COMPILER=clang --DCMAKE_CXX_COMPILER=clang++ 



Creating AI Pipeline for Cell Image Analysis: Insights, Challenges, and CHO Use Case (Part 1 of 2, Intel Edge AI in the Realm of Biopharma and Drug Development)

April 23, 2024

In the ever-evolving landscape of biopharmaceutical technology and drug development, a recent effort in the field of Cell Analytics for Monoclonal Antibody Production has shed light on the crucial role of Edge AI Technology in navigating complex challenges of scaling and producing solutions.  

In this 2-part blog series, we will explore the use of Intel Edge AI Technology in biopharma and drug development, addressing challenges and providing insights into the development of AI pipelines for cell segmentation and analysis.

Intel has been involved in this process with a variety of partners. One of Intel’s contributions to the cell image project centers around processing brightfield1 images using an AI pipeline containing multiple deep learning models. The pipeline's purpose is to identify cells and other biological components and provide feedback on dynamic biological characteristics such as cell morphology, viability, and phenotypic changes, among others. Throughout this process, working on cell-AI projects usually brings a unique set of challenges to the forefront.  

First, it is an interdisciplinary field and the knowledge gap between data scientists and biopharma experts requires more back-and-forth clear communications for planning and validity checks. Frequently when attempting to implement AI solutions in the laboratory, data scientists and bench scientists struggle to fully grasp the nature and needs of each other’s role. This lack of mutual understanding can also hinder the usability and scalability of an AI solution needing to be integrated into diverse lab environments.

The second challenge is instrument variability. Different plate reader2 microscopes have different hardware, optics, and apertures which cause their produced images not to be consistent. This adds an extra layer of work to assess and address these inconsistencies along the way (like regular tracked calibration and adjustment). Additionally, equipment vendor-to-vendor differences, culture temperature, medium conditions, and genetic modifications can all affect the variability of data and the inherent transferability of the deep learning pipeline. This would drive the need to monitor the performance of DL models at the edge and cloud ML ops components.    

The third challenge is obtaining peer-review labels because the process is based on supervised Machine Learning and obtaining clean accurate labels is very costly and time-consuming.  

And the last challenge is about the model deployment. In most cases, cloud deployment is not an option due to data size and data privacy. Produced images from plate reader microscopes are huge and transferring data to the cloud and sending the results back would create high latency because a huge amount of data must be streamed (30Gb per hour). And more importantly, laboratories are usually not willing to share the data. Due to these two constraints, cloud deployments are not usually an option, and the pipeline must be deployed at the edge.  

Now, let’s talk about a specific application of this technology: the CHO Cell Segmentation Use Case.

CHO Cell Segmentation Use Case

CHO cells, or Chinese Hamster Ovary cells, are a cornerstone in the production of complex protein molecules such as monoclonal antibodies, fusion proteins, hormones, and coagulation factors. Unlike stem cells or CAR-T cells, where the cells themselves are the therapeutic product, in CHO cells, it is the proteins they produce that are of paramount importance. Monitoring the health, viability, and production capability of these cells is a critical step in commercial protein production.

Traditionally, assessing the condition of CHO cells involves a multi-step process that is not only time-consuming but also requires the use of expensive reagents and chemicals. Depending on the process, the workflow can be something like below.  

  1. Culture cells  
  1. Fix cells – wash in expensive reagents to remove the culture medium.  
  1. Permeabilization – wash in more expensive chemicals to permeabilize the cell membrane (to stain for intercellular proteins).
  1. Blocking – incubate cells in another expensive reagent to prevent binding of no specific antibodies.
  1. Primary Antibody Incubation – antibody specifically to bind to a protein that is being produced.
  1. Washing – removing unbound Primary Antibodies using more expensive chemicals.
  1. Nuclear staining – use nuclear stain like DAPI to visualize cell nuclei then wash with the same chemicals from the washing step
  1. Mounting – get ready to read in the microscope (plate reader1)
  1. Imaging – Stained cells …. count them up and determine the state in the protein production cycle and relative cell health (eventually they peter out and stop producing and the batch needs to be flushed. (Cell count, viability number, etc. are the output not the image)

From culturing to imaging, each step plays a vital role in ensuring the quality of the protein product. However, with the advent of AI and deep learning, there is an opportunity to streamline this workflow significantly. Using an AI pipeline including multiple Deep Learning models and data pre and post-processing, we can go from Step 1 directly to Step 9, removing the majority of the labor and latency in getting actionable results out of a staining workflow and bypassing expensive specialty chemicals requirement. Intel has put together a reference implementation for deploying said pipeline and inferencing of these images on the edge as part of the Cell Image project OpenVINO Toolkit, OpenVINO Model Server, and AI Connect for Scientific Data are used in this design. Let’s briefly talk about each of these wonderful SW packages in part 2 of this article series. Stay tuned!


In conclusion, the integration of Intel Edge AI Technology into the biopharmaceutical sector represents a transformative step towards more efficient and scalable drug development processes. As we have seen in this first installment of our blog series, the deployment of AI pipelines for cell segmentation and analysis in monoclonal antibody production is not without its challenges. These include bridging the interdisciplinary knowledge gap, managing instrument variability, acquiring peer-reviewed labels, and overcoming the hurdles associated with model deployment.

Despite these challenges, the potential benefits of Edge AI in biopharma are substantial. By leveraging Intel's advanced AI technologies, we can significantly reduce the time and cost associated with traditional cell analysis methods, while also enhancing the accuracy and reliability of the results. The use of edge computing addresses the concerns of data size and privacy, allowing for real-time processing and analysis without the need for cloud transfer.

As we move forward in this blog series, we will delve deeper into the specifics of Intel's Edge AI solutions, including the OpenVINO toolkit, OpenVINO Model Server, and AI Connect for Scientific Data. We will explore how these tools are being applied in real-world scenarios to drive innovation and improve outcomes in the realm of biopharma and drug development in the next part of this series.

Reach out to Intel's Health and Life Sciences team at or learn more about what we do at


We'd like to hear from you! Let us know in the comments or discuss – which AI use cases in health and life sciences do you think will have the greatest impact on global health?

If you enjoyed hearing from the Health and Life Sciences team and want to hear more, give this post a like and ensure you subscribe to get the latest updates from the team. 


About the Author

Nooshin Nabizadeh has Ph.D. in Electrical and Computer Engineering from the University of Miami and works at Intel Corporation as AI Solutions Architect. She enjoys photography, writing poetry, reading about psychology and philosophy, and optimizing solutions to run as fast as possible on a given piece of hardware. Connect with her on LinkedIn by mentioning this blog.

  1. Brightfield microscopy is a widely used technique for observing the morphology of cells and tissues.
  1. A plate reader is a laboratory instrument used to obtain images from samples in microtiter plates. The reader shines a specific calibrated frequency of light (UV, visible, fluorescence, etc.) through the samples in the wells of the plate. Plate reader microscopy data sets have inherent variability which drives the requirement of regular tracked calibration and adjustment.


Creating AI Pipeline for Cell Image Analysis: Intel Edge AI SW Solutions (Part 2 of 2, Intel Edge AI in the Realm of Biopharma and Drug Development)

Welcome back to our blog series on"Intel Edge AI in the Realm of Biopharma and Drug Development." Inthe first installment, we discussed the importance of Cell Analytics forAntibody Production in biopharmaceutical technology and drug development. Wehighlighted how AI pipelines are used to process brightfield images of cells,providing insights and addressing challenges in this field. Specifically, weexplored the CHO Cell Segmentation Use Case and noted that Intel has developeda reference implementation for deploying the CHO Cell Segmentation pipelineusing Intel edge AI software solutions.

Now, let's delve deeper into thespecifics of these Edge AI solutions: the OpenVINO toolkit, OpenVINO ModelServer, and AI Connect for Scientific Data. We'll explore how each of thesetools can play a crucial role in advancing biopharma and drug development.



optimizes, tunes, and runscomprehensive deep learning inferencing on general-purpose Intel architecture. Itis an open-source toolkit that accelerates AI inference with lower latency andhigher throughput while maintaining accuracy, reducing model footprint,and optimizing hardware use. It streamlines AI development and integration ofdeep learning in domains like computer vision, large language models (LLM), andgenerative AI.​

At the core of the OpenVINOtoolkit, we have the OpenVINO Runtime that loads and runs the models.The run-time employs plugins that are responsible for efficiently executing low-leveloperations that the deep learning model has on Intel HW. We have differentplug-ins for different HW, like CPU plugins, GPU plugins, and heterogeneousplugins.

The CPU plugin achieves a highperformance of neural networks on the CPU, using the Intel® Math Kernel Libraryfor Deep Neural Networks (Intel® MKL-DNN).

The GPU plugin uses the Intel® ComputeLibrary for Deep Neural Networks (clDNN) to infer deep neural networks on GPUs.

The heterogeneous plugin enablescomputing the inference of one network on several devices. The purposes ofexecuting networks in heterogeneous mode are to:

·        Utilize the power of accelerators to processthe heaviest parts of the network and to execute unsupported layers on fallbackdevices like the CPU.

·        Utilize all available hardware moreefficiently during one inference.


Another part of the OpenVINO toolkitis the model optimizer which optimizes and converts the model frompopular deep learning frameworks like TensorFlow, PyTorch, and ONNX, to OpenVINOintermediate representation format. The models are optimized withtechniques such as quantization, freezing, fusion, and more. Models can bedeployed across a mix of Intel® hardware and environments, on-premise andon-device, in the browser, or the cloud.

Besidesinference, OpenVINO provides the NeuralNetwork Compression Framework (NNCF) tool for implementing compressionalgorithms on models during training.

Figure1: OpenVINO™ overview. For detailed documentation about OpenVINO™ see:


OpenVINO™ Model Server (OVMS)

When it comes to deployment, youcan use OpenVINO Runtime, or you can use OpenVINO Model Server orOVMS for short.

OVMS is a scalable,high-performance tool for serving AI models and pipelines. It centralizes AI modelmanagement, ensuring consistent AI models across numerous devices, clouds,or compute nodes. Simply put, OVMS is a microservice that loads yourmodels, manages them, and exposes their capabilities through a network API,allowing other system components to interact with and utilize these models.OVMS supports two types of APIs—TensorFlow Serving and KServe compatible—whichprovide inference, model status, and model metadata services via gRPC orRESTful API2.

Why choose OVMS over OpenVINORuntime? There are several scenarios where OVMS is the better option. OpenVINOis a C++ project with an official Python binding, but what if your softwarestack is in another language? Implementing your own interface can be challenging.OVMS simplifies this by integrating OpenVINO into your system with pre-existingcapabilities. Additionally, if your system already operates in a microserviceparadigm, OVMS is an obvious choice. You might also prefer not to integrateOpenVINO directly into the business logic of other components or deal with thecomplexities of building the system. Moreover, if some applications run on lesspowerful devices, such as mobile phones, and you want to offload heavyinferencing to more powerful machines, OVMS can handle this by exposing anetwork API. Your components can run on multiple devices, sending data requeststo OVMS and receiving model outputs in response.

OVMS is ideal for scaling yoursolution. For instance, in a multi-node Kubernetes cluster, you can createmultiple replicas and set a load balancer in front of them, achieving highavailability and throughput beyond the capability of a single node. Thisaggregation is easily managed by OVMS.

For security and privacy, OVMSallows you to host your model server on a trusted machine, ensuring that otherapplications accessing it cannot see the model itself—only the exposed interface.

Besides these, for security andprivacy purposes OVMS enables you to host your model server on a machinethat you trust, and all the other applications that access it from inside oroutside cannot see your model. you just expose its interface with otherapplications and they can’t access or see the model.


Figure2. OpenVINO Model Server

Let’s examine the OVMS structure (Figure2). At the top, we have a network interface with gRPC and RESTful endpointssupporting TF Serving API and KServe API for inference and metadata calls.Metadata provides information on expected model inputs and outputs.

At next level, we haveconfiguration monitoring, scheduler, and model management. OVMS can servemultiple models simultaneously, specified in a configuration file, withbuilt-in model management and versioning. The model files don’t need to resideon a local file system; OVMS supports remote storage systems likeGoogle Cloud, AWS S3, and Azure. For learning more about OVMS please visit here.


AIConnect for Scientific Data (AiCSD)

AI Connect for Scientific Data (AiCSD)is an open-source software sample that connects data from scientificinstruments to AI pipelines and runs workloads at the edge.

It also manages pipelines for imageprocessing and automated image comparisons. AiCSD is a containerizedmicroservices-based solution utilizing open-source EdgeX Services and connectedby a secure Redis Message Broker and various communication APIs, whichmakes it adaptable for different use cases and settings. Figure 3 shows theservices created for this reference implementation.

The architectural components of AiCSDinclude:

·        Microservices: Provided by Intel, themicroservices include a user interface and applications for managing files andjobs.

·        EdgeX Application Services: AiCSDuses the APIs from the EdgeX Applications Services to communicate and transferinformation.

·        EdgeX Services:The services include the database, message broker, and security services.

·        Pipeline Execution: AiCSDfurnishes an example pipeline for pipeline management.

·        File System: AiCSD stores and manages inputand output files.

·        Third-party Input Devices: The devicessupply the images that will be processed. Examples include an opticalmicroscope or conveyor belt camera.


The reference architecture lets imagesbe processed using assigned jobs. The job tracks the movement of the file, thestatus, and any results or outputs from the pipeline. To process a job, sometasks help match information about a job to the appropriate pipeline to run.

The process can be elaborated asbelow:

1.       The InputDevice/Imager writes the file to the OEM file system in a directorythat is watched by the File Watcher. When the File Watcher detects thefile, it sends the job (JSON struct of particular fields) to the DataOrganizer via HTTP Request.

2.       The DataOrganizer sends the job to the Job Repository to create a new job inthe Redis Database. The job information is then sent to the TaskLauncher to determine if there is a task that matches the job. If there is,the job proceeds to the File Sender (OEM).

3.       The FileSender (OEM) is responsible for sending both the job and the file to the FileReceiver (Gateway). Once the File Receiver (Gateway) has written thefile to the Gateway file system, the job is then sent on to the TaskLauncher.

4.       The TaskLauncher verifies that there is a matching task for the job before sendingit to the appropriate pipeline using the EdgeX Message Bus (via Redis).The ML pipeline subscribes to the appropriate topic and processes the file inits pipeline. The output file (if there is one) is written to the file systemand the job is sent back to the Task Launcher.

5.       The TaskLauncher then decides if there is an output file or if there are just results.In the case of only results and no output file, the Task Launcher marks the jobas complete. If there is an output file, the Task Launcher sends the jobonward to the File Sender (Gateway).

6.       The FileSender (Gateway) publishes the job information to the EdgeXMessage Bus via Redis for the File Receiver (OEM) to subscribe andpull. The File Receiver (OEM) sends an HTTP request to the File Sender(Gateway) for the output file(s). The file(s) are sent as part of theresponse and the File Receiver (OEM) writes the output file(s) to thefile system.


Figure 3: Architecture andHigh-level Dataflow


AI Pipeline for CHOCell Segmentation Use Case

Let’s explain AI pipeline for CHOcell segmentation at a high level. As plate readers3 generate cellimages in their local file system, these images need to be transferred toanother device for analysis where AI software and hardware resources areavailable. This separation of data and model locations requires a flexible,microservice-based solution. We use the AiCSD microservice infrastructure totransfer the data to the edge compute device. AiCSD leverages EdgeX FoundryMicroservices to facilitate the automatic detection, management, and transferof scientific data. This microservice flexibility is crucial for addressing theheterogeneous system integration and asymmetric data interfacing inherent inthis project.

The AI pipeline on the edgecompute device includes image preprocessing, inference of multiple DeepLearning models optimized by the OpenVINO toolkit, and image postprocessing.Figure 4 shows an example of using Deep Learning models to process cell images,where UNet is used to mask and count MSC nuclei. These processes arecontainerized using BentoML,an open-source tool. Additionally, the OpenVINO Toolkit accelerates DeepLearning model inference, providing lower latency and higher throughput whilemaintaining accuracy and optimizing hardware usage. OVMS handles model managementand version control. Once the AI pipeline processing job is completed on theedge compute device, the final results are transferred back to the local filesystem of the original scientific device using the AiCSD microserviceinfrastructure to complete the task.

Overall, the integration of IntelEdge AI solutions enables the efficient implementation of the AI pipeline forthe CHO cell segmentation use case.


Figure 4. MSC Nuclei countingusing UNet deep learning model.



In the article, we discussed theimplementation of an AI pipeline for cell image analysis, particularly focusingon the application of Intel Edge AI solutions in processing brightfield cellimages. We highlighted Intel's OpenVINO toolkit as a crucial component foroptimizing the inference of existing Deep Learning models within the cell AIpipeline. Additionally, we explained how the OpenVINO Model Server operates asa microservice, enabling other components within a system to interact with andutilize the models effectively. Furthermore, we explored AI Connect forScientific Data (AiCSD) and its role in the efficient implementation of thebrightfield cell image analysis pipeline.

The journey toward fully realizing thecapabilities of AI in biopharma is ongoing, and Intel's contributions arepaving the way for a future where drug development is more agile, precise, andpatient-centric. Stay tuned for further insights as we continue to explore theexciting intersection of Edge AI technology and biopharmaceutical research.

Reach out to Intel's Health and LifeSciences team at or learn more about what we doat

We'd like to hear from you! Let usknow in the comments or discuss – which AI use cases in health and lifesciences do you think will have the greatest impact on global health?

If you enjoyed hearing from the Healthand Life Sciences team and want to hear more, give this post a like andensure you subscribe to get the latest updates from theteam. 


About the Author

Nooshin Nabizadeh has Electrical and Computer Engineering from the University of Miami and worksat Intel Corporation as AI Solutions Architect. She enjoys photography, writingpoetry, reading about psychology and philosophy, and optimizing solutions torun as fast as possible on a given piece of hardware. Connect with her onLinkedIn by mentioning thisblog.



1.      Brightfieldmicroscopy is a widely used technique for observing the morphology of cells andtissues.


3.      A plate reader is a laboratoryinstrument used to obtain images from samples in microtiter plates. The readershines a specific calibrated frequency of light (UV, visible, fluorescence,etc.) through the samples in the wells of the plate. Plate reader microscopydata sets have inherent variability which drives the requirement of regulartracked calibration and adjustment.


No items found.