moondream2 is a small vision language model designed to run efficiently on edge devices. Although the model has a small number of parameters, it provides high-performance visual processing capabilities. It can quickly understand and process input images and respond to user queries. The model was developed by VikhyatK and is released under the permissive Apache 2.0 license, allowing for commercial use.
MeloTTS released by MyShell.ai, is a high-quality, multilingual Text-to-Speech (TTS) library that supports English, Chinese (mixed English), and various other languages. The strengths of the model lie in its lightweight design, which is well-suited for applications on AIPC systems, coupled with its impressive performance. In this article, I will guide you through the process of converting the model to be compatible with OpenVINO toolkits, enabling it to run on various devices such as CPUs, GPUs and NPUs. Additionally, I will provide a concise overview of the model's inference procedure.
Overview of Model Inference Procedure and Pipeline
For each language type, the pipeline requires only two models (two inference procedures). For instance, English language generation necessitates just the 'bert-base-uncased' model and its corresponding MeloTTS-English. Similarly, for Chinese language generation (which includes mixed English), the pipeline needs only the 'bert-base-multilingual-uncased' and MeloTTS-Chinese. This greatly streamlines the pipeline compared to other TTS frameworks, and the compact size of the models makes them suitable for deployment on edge devices.
The text encoder accepts phones, tones and a hidden layer from a BERT model as input. It then produces the text encoder's output along with its mean value and a logarithmic variance. To align the input texts with the target speech, the outputs from the encoders are processed using a stochastic duration predictor, which generates an alignment matrix. This matrix is then used to expand the mean value and a logarithmic variance (assuming a Gaussian distribution) to obtain the results for the latent variables .Subsequently, the inverse flow transformation is applied to obtain the distribution of final latent variable z, which represents the spectrogram. In the decoder, by upsampling the spectrogram, the final audio waveform is obtained.
The inference entry is the function. In practical inference, phone refers to the distinct speech sounds, while tone refers to the vocal pitch contour. For Chinese, a phone corresponds to pinyin, and a tone corresponds to one of the four tones. In English, phones are the consonants and vowels, and tones relate to stress patterns. Here, noise_scale and noise_scale_w do not refer to actual noise. Both noise_scale_w and noise_scale are components within the Stochastic Duration Predictor, used to introduce randomness in order to enhance the expressiveness of the model.
Note that MeloTTS does not include a voice cloning component, unlike the majority of other TTS models, which makes it more lightweight. If voice cloning is required, please refer to OpenVoice.
Enable Model for OpenVINO
As previously mentioned, the pipeline requires just two models for each language. Taking English as an example, we must first convert both 'bert-base-uncased' and 'MeloTTS-English' into the OpenVINO IR format.
example_input={
"x": x_tst,
"x_lengths": x_tst_lengths,
"sid": speakers,
"tone": tones,
"language": lang_ids,
"bert": bert,
"ja_bert": ja_bert,
"noise_scale": noise_scale,
"length_scale": length_scale,
"noise_scale_w": noise_scale_w,
"sdp_ratio": sdp_ratio,
}
ov_model = ov.convert_model(
self.model,
example_input=example_input,
)
get_input_names = lambda: ["phones", "phones_length", "speakers",
"tones", "lang_ids", "bert", "ja_bert",
"noise_scale", "length_scale", "noise_scale_w", "sdp_ratio"]
for input, input_name in zip(ov_model.inputs, get_input_names()):
input.get_tensor().set_names({input_name})
outputs_name = ['audio']
for output, output_name in zip(ov_model.outputs, outputs_name):
output.get_tensor().set_names({output_name})
"""
reshape model
Set the batch size of all input tensors to 1
""" shapes = {}
for input_layer in ov_model.inputs:
shapes[input_layer] = input_layer.partial_shape
shapes[input_layer][0] = 1 ov_model.reshape(shapes)
ov.save_model(ov_model, Path(ov_model_path))
For instance, we convert the MeloTTS-English model from the pytorch format directly by utilizing the openvino.convert_model API along with pseudo input data.
Note that the input and output layers (it is optional) are renamed to facilitate subsequent development. Furthermore, the batch dimension for all inputs is fixed at 1, as multiple batches are not required here (this is also optional).
We further quantized both the BERT and TTS models to int8 using pseudo data. We observed that our method of quantizing the TTS model introduces a slight distortion to the current sound. To suppress this, we implemented DeepFilterNet, which is also very lightweight.
More about model conversion and int8 quantization please refer to MeloTTS-OV .
Run BERT part on NPU
To enhance performance and reduce CPU offloading, we can shift the execution of the BERT model to the NPU on Meteor Lake.
To adapt the model for the NPU, we've converted the model to accept static shape inputs a and pad each input during inference.
def reshape_for_npu(model, bert_static_shape = 32):
# change dynamic shape to static shape
shapes = dict()
for input_layer in model.inputs:
shapes[input_layer] = bert_static_shape
model.reshape(shapes)
ov.save_model(model, Path(ov_model_save_path))
print(f"save static model in {Path(ov_model_save_path)}")
def main():
core = Core()
model = core.read_model(ov_model_path)
reshape_for_npu(model, bert_static_shape=bert_static_shape)
Simple Demo
Here are the audio files generated by the int8 quantized model from OpenVINO.
MiniCPM is an End-Side LLM developed by ModelBest Inc. and TsinghuaNLP. MiniCPM-V is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. The models take image and text as inputs and provide high-quality text outputs. MiniCPM-V 2.0 is an efficient version with promising performance for deployment. The model is built based on SigLip-400Mand MiniCPM-2.4B, connected by a perceiver resampler. On this blog, we provide the OpenVINO™ optimization for MiniCPM-V 2.0 on Intel® platforms.
python ov_minicpm-v2-test.py -m /path/to/ MiniCPM-V-2 _ov -pic /path/to/hk_OCR.jpg -p “Describe the content of the image”
Question: Describe the content of the image
Answer:
The image captures the vibrant and bustling atmosphere of abusy city street in Hong Kong. The street is lined with an array of neon signsand billboards, each one advertising a different business or establishment. Thesigns are in a variety of languages, including English, Chinese, and Japanese,reflecting the multicultural nature of the city.
The street itself is a hive of activity with several busesand a tram making their way through the traffic. The vehicles are in motion,adding a dynamic element to the scene.
The sky above is a beautiful gradient of colors,transitioning from a deep blue at the top to a lighter shade at the bottom.This suggests that the photo was taken during either sunrise or sunset, castinga warm glow over the cityscape.
The image also contains several text elements, including thenames of various establishments and the brand names of products. These textsadd another layer of information to the scene, providing insights into thenature of the businesses and the products they offer.
Overall, the image provides a vivid snapshot of life in HongKong, capturing the city's vibrant energy and the diverse range of businessesand products that make up its bustling streets.
Authors: Tianmeng Chen, Xiake Sun, Fiona Zhao, Su Yang
Introduction
Personalized Speech Synthesis is the process of using some recording devices around you to record certain voice clips of a particular person, and then letting Text-To-Speech (TTS) technology synthesize the voice, manner of speaking, and emotion of a particular person. SAMBERT-HifiGAN is a complete personalized TTS solution designed by Alibaba Damo Institute, which includes the first part of SAMBERT's acoustic model and the second part of the HifiGAN vocoder.
In this blog, we will introduce how to utilize OpenVINOTM Python API to enable the SAMBERT-HifiGAN pipeline. All the project code can be found here.
KAN-TTS by Ali provides a tutorial for training SAMBERT-HifiGAN. A pipeline for personalized speech synthesis based on PyTorch is provided on modelscope, what we will do here is toreplace the PyTorch based part of it with OpenVINOTM. It is worth noting that due to some of the operators in the model, there are some modules that cannot be replaced with OpenVINOTM Python API.
Pre-requisite
Since we need to make changes on the pipeline based on PyTorch backend, the first thing we need to do is to download the KAN-TTS source code and successfully run through the pipeline to get the inputs and outputs of the model as well as the state of the middle layer. Of course we also need the OpenVINOTM environment.
Get the KAN-TTS source code and create anacondaenvironment.
git clone -b develop https://github.com/alibaba-damo-academy/KAN-TTS.git
cd KAN-TTS
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
conda env create -f environment.yaml
conda activate maas
Then we install openvino in same environment. Ifyou want specific version of OpenVINOTM, you can install it byyourself through Install OpenVINO™.
pip install openvino
Follow the KAN-TTS practice tutorial of official with readme in ModelScope.
After you finish the pipelining of KAN-TTS, you can get the res folder and ckpt filesspeech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k .
Get the OpenVINOTM backend projectsource code and copy the res folder to project folder.
Aftera few minutes, you will get two converted OpenVINOTM model sambert_encoder.xml sambert_encoder.bin and hifigan_t.xml hifigan_t.bin.
In the code after we load the model and get the inputs, we add the following code to convert the loaded PyTorch backend model to OpenVINOTM backend model and save it.
Run the inferencewith OpenVINOTM model
Before running the inference, the res folder should be renamed to allow for comparisons later.
After a few minutes, you will get the wav file in res/test_male_ptts_syn. For example in test.txt we write a random sentence:
After running pipeline, we will get a 7 seconds wav file under res folder:
In the code we modified the original pytorch banckend inference code so that pipeline uses openvino backend for inference.
Summary
This blog describes about how to run the SAMBERT-HifiGANpipeline using the OpenVINOTM Python API, please see the source code formore details and modifications.