Enable Personalized Text-to-Speech Pipeline with SAMBERT-HifiGAN via OpenVINO Python API

Authors: Tianmeng Chen, Xiake Sun, Fiona Zhao, Su Yang

Introduction

Personalized Speech Synthesis is the process of using some recording devices around you to record certain voice clips of a particular person, and then letting Text-To-Speech (TTS) technology synthesize the voice, manner of speaking, and emotion of a particular person.  SAMBERT-HifiGAN is a complete personalized TTS solution designed by Alibaba Damo Institute, which includes the first part of SAMBERT's acoustic model and the second part of the HifiGAN vocoder.

Structure of SAMBERT
Structure of HifiGAN

In this blog, we will introduce how to utilize OpenVINOTM Python API to enable the SAMBERT-HifiGAN pipeline. All the project code can be found here.

KAN-TTS by Ali provides a tutorial for training SAMBERT-HifiGAN. A pipeline for personalized speech synthesis based on PyTorch is provided on modelscope, what we will do here is toreplace the PyTorch based part of it with OpenVINOTM. It is worth noting that due to some of the operators in the model, there are some modules that cannot be replaced with OpenVINOTM Python API.

Pre-requisite

Since we need to make changes on the pipeline based on PyTorch backend, the first thing we need to do is to download the KAN-TTS source code and successfully run through the pipeline to get the inputs and outputs of the model as well as the state of the middle layer. Of course we also need the OpenVINOTM environment.

  1. Get the KAN-TTS source code and create anacondaenvironment.
git clone -b develop https://github.com/alibaba-damo-academy/KAN-TTS.git
cd KAN-TTS
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
conda env create -f environment.yaml
conda activate maas
  1. Then we install openvino in same environment. Ifyou want specific version of OpenVINOTM, you can install it byyourself through Install OpenVINO™.
pip install openvino
  1. Follow the KAN-TTS practice tutorial of official with readme in ModelScope.

          After you finish the pipelining of KAN-TTS, you can get the res folder and ckpt filesspeech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k .

  1. Get the OpenVINOTM backend projectsource code and copy the res folder to project folder.
git clone https://github.com/TianmengChen/SambertHifigan_OV.git  
cp -r $KAN-TTS_PATH/res  $SambertHifigan_OV/

Convert torch modelto openVINOTM model

Converting a torch model to OpenVINOTM requires model inputs. So we usetest.txt as input of SAMBERT and use the res folder as input of HifiGAN.

python kantts/bin/text_to_wav.py --txt test.txt --output_dir res/test_male_ptts_syn --res_zip speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/resource.zip --am_ckpt speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/pretrain_work_dir/tmp_am/ckpt/checkpoint_2400200.pth --voc_ckpt speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/pretrain_work_dir/orig_model/basemodel_16k/hifigan/ckpt/checkpoint_2400000.pth  --se_file speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/pretrain_work_dir/data/se/se.npy --is_ov_convert

Aftera few minutes, you will get two converted OpenVINOTM model sambert_encoder.xml sambert_encoder.bin and hifigan_t.xml hifigan_t.bin.

In the code after we load the model and get the inputs, we add the following code to convert the loaded PyTorch backend model to OpenVINOTM backend model and save it.

Example code of SAMBERT pipeline
Example code of HifiGAN pipeline

Run the inferencewith OpenVINOTM model

Before running the inference, the res folder should be renamed to allow for comparisons later.

mv res res_pytorch

then run the command below.

python kantts/bin/text_to_wav.py --txt test.txt --output_dir res/test_male_ptts_syn --res_zip speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/resource.zip --am_ckpt speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/pretrain_work_dir/tmp_am/ckpt/checkpoint_2400200.pth --voc_ckpt speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/pretrain_work_dir/orig_model/basemodel_16k/hifigan/ckpt/checkpoint_2400000.pth  --se_file speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/pretrain_work_dir/data/se/se.npy

After a few minutes, you will get the wav file in res/test_male_ptts_syn. For example in test.txt we write a random sentence:

After running pipeline, we will get a 7 seconds wav file under res folder:

In the code we modified the original pytorch banckend inference code so that pipeline uses openvino backend for inference.

Example code of SAMBERT pipeline
Example code of HifiGAN pipeline

Summary

This blog describes about how to run the SAMBERT-HifiGANpipeline using the OpenVINOTM Python API, please see the source code formore details and modifications.