Enable Personalized Text-to-Speech Pipeline with SAMBERT-HifiGAN via OpenVINO Python API
Authors: Tianmeng Chen, Xiake Sun, Fiona Zhao, Su Yang
Introduction
Personalized Speech Synthesis is the process of using some recording devices around you to record certain voice clips of a particular person, and then letting Text-To-Speech (TTS) technology synthesize the voice, manner of speaking, and emotion of a particular person. SAMBERT-HifiGAN is a complete personalized TTS solution designed by Alibaba Damo Institute, which includes the first part of SAMBERT's acoustic model and the second part of the HifiGAN vocoder.
In this blog, we will introduce how to utilize OpenVINOTM Python API to enable the SAMBERT-HifiGAN pipeline. All the project code can be found here.
KAN-TTS by Ali provides a tutorial for training SAMBERT-HifiGAN. A pipeline for personalized speech synthesis based on PyTorch is provided on modelscope, what we will do here is toreplace the PyTorch based part of it with OpenVINOTM. It is worth noting that due to some of the operators in the model, there are some modules that cannot be replaced with OpenVINOTM Python API.
Pre-requisite
Since we need to make changes on the pipeline based on PyTorch backend, the first thing we need to do is to download the KAN-TTS source code and successfully run through the pipeline to get the inputs and outputs of the model as well as the state of the middle layer. Of course we also need the OpenVINOTM environment.
- Get the KAN-TTS source code and create anacondaenvironment.
- Then we install openvino in same environment. Ifyou want specific version of OpenVINOTM, you can install it byyourself through Install OpenVINO™.
- Follow the KAN-TTS practice tutorial of official with readme in ModelScope.
After you finish the pipelining of KAN-TTS, you can get the res folder and ckpt filesspeech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k .
- Get the OpenVINOTM backend projectsource code and copy the res folder to project folder.
Convert torch modelto openVINOTM model
Converting a torch model to OpenVINOTM requires model inputs. So we usetest.txt as input of SAMBERT and use the res folder as input of HifiGAN.
Aftera few minutes, you will get two converted OpenVINOTM model sambert_encoder.xml sambert_encoder.bin and hifigan_t.xml hifigan_t.bin.
In the code after we load the model and get the inputs, we add the following code to convert the loaded PyTorch backend model to OpenVINOTM backend model and save it.
Run the inferencewith OpenVINOTM model
Before running the inference, the res folder should be renamed to allow for comparisons later.
then run the command below.
After a few minutes, you will get the wav file in res/test_male_ptts_syn. For example in test.txt we write a random sentence:
After running pipeline, we will get a 7 seconds wav file under res folder:
In the code we modified the original pytorch banckend inference code so that pipeline uses openvino backend for inference.
Summary
This blog describes about how to run the SAMBERT-HifiGANpipeline using the OpenVINOTM Python API, please see the source code formore details and modifications.