Optimizing Speech Emotion Recognition: SpeechBrain Meets OpenVINO™

No items found.

Authors: Pradeep Sakhamoori, Ravi Panchumarthy

Introduction
Want to analyze emotions in speech recordings but keep your AI application lean and mean? This blog post dives into combining the power of SpeechBrain's pre-trained emotion recognition models with OpenVINO™ for efficient inference. We'll explore how to leverage SpeechBrain's"emotion-recognition-wav2vec2-IEMOCAP" model and optimize it for blazing-fast performance using OpenVINO™.

Getting Started with SpeechBrain
SpeechBrain is a powerful open-source toolkit for developing Conversational AI technologies, including speech recognition, speaker recognition, and emotion recognition. In this blog post, we'll explore using SpeechBrain's pre-trained "emotion-recognition-wav2vec2-IEMOCAP" model to classify emotions in speech recordings and optimize this model for efficient inference using the OpenVINO™toolkit. 

SpeechBrain's Emotion Recognition Model and the IEMOCAP Dataset:

The "emotion-recognition-wav2vec2-IEMOCAP"model is fine-tuned on the IEMOCAP dataset, which contains approximately 12hours of audiovisual data with recordings of dialogues by 10 speakers portraying various emotions, including angry, excited, fear, sad, surprised, frustrated, happy, disappointed, and neutral. 

 The model is based on the wav2vec2 architecture, combining convolutional and residual blocks. The embeddings are extracted using attentive statistical pooling, and the system is trained with Additive Margin Softmax Loss. 

Loading Custom Models with SpeechBrain's foreign_class Function:

The ‘foreign_class’ function in SpeechBrain is a utility that allows you to load and use custom PyTorch models within the SpeechBrain ecosystem. It provides a convenient way to integrate external or custom-built models into SpeechBrain's inference pipeline without modifying the core SpeechBrain codebase.

Here’s how you can load and use the “emotion-recognition-wav2vec2-IEMOCAP”model with foreign_class:

classifier = foreign_class(
    source="speechbrain/emotion-recognition-wav2vec2-IEMOCAP",
    pymodule_file="custom_interface.py",
    classname="CustomEncoderWav2vec2Classifier"
)
# Initialize wav2vec2 torch model
torch_model = classifier.mods["wav2vec2"].model

# Run Inference
out_prob, score, index, text_lab = classifier.classify_file("speechbrain/emotion-recognition-wav2vec2-IEMOCAP/anger.wav")

print(f"Emotion Recognition with SpeechBrain PyTorch model: {text_lab}")

  • source: This argument specifies the source or location of the pre-trained model checkpoint. In this case, "speechbrain/emotion-recognition-wav2vec2-IEMOCAP"refers to a pre-trained model checkpoint on the Hugging Face Hub.
  • pymodule_file: This argument is the path to a Python file containing the definition of your custom PyTorch model class. In this example, "custom_interface.py" is the Python file name defining theCustomEncoderWav2vec2Classifier class.
  • classname: This argument specifies the name of the custom PyTorch model class defined in the pymodule_file. In this case,"CustomEncoderWav2vec2Classifier" is the name of the class that extends SpeechBrain's Pretrained class and implements the necessary methods for inference.
  • classifier.classify_file: This is the Inference function call for emotion classification on an audio file.


Optimizing with OpenVINO™

To enhance the performance of our emotion recognition model, we leverage the OpenVINO™ toolkit.  OpenVINO™ empowers developers to write code once and deploy it across diverse Intel® hardware and environments. This includes on-premises, on-device, cloud, and browser deployments. You can also configure performance optimization parameters based on the use case, hardware, and target performance (latency/throughput). For more details, refer to OpenVINO runtime optimizations.

Refer to the OpenVINO SpeechBrain notebook for full code implementation.

Following are the key steps to optimize the model using OpenVINO™:

Step 1: Convert the model to OpenVINO format:

Below is a code snippet illustrating the conversion of the SpeechBrain PyTorch model to OpenVINO IR format using openvino.convert_model Python API.

import openvino as ov
ov_model = ov.convert_model(torch_model,example_input=input_tensor)

Step 2: Run Inference with OpenVINO™ Inference Engine:

After converting the model to OpenVINO format, compile the converted model for your target device and run inference. Below is a sample inference code snippet from the OpenVINO SpeechBrain notebook. For details on Inference Devices and Modes, see optimize-inference.

# OpenVINO Compiled model
compiled_model = core.compile_model(ov_model, device.value)

# Perform model inference
output_tensor = compiled_model(wavs)[0]

Conclusion:

Integrating SpeechBrain’s pre-trained models with custom interfaces and optimizing them using OpenVINO™ can significantly enhance the efficiency of your AI applications. This approach not only improves model performance but also ensures seamless deployment across different hardware platforms. By following the steps outlined above, you can build a robust SpeechBrain Emotion Recognition Model optimized with OpenVINO™ runtime that is both powerful and efficient.


Call to Action:

Notices and Disclaimers:

Performance varies by use, configuration, and other factors. Learn more at www.intel.com/PerformanceIndex​.​Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available ​updates. No product or component can be absolutely secure. ​​​​Intel technologies may require enabled hardware, software or service activation.​​​​​​​​

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.​​