"S2CD: Self-heuristic Speaker Content Disentanglement for Any-to-Any Voice Conversion" by Pengfei Wei

Selected Works of Zhiqiang Xu

Article

S2CD: Self-heuristic Speaker Content Disentanglement for Any-to-Any Voice Conversion

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Pengfei Wei, ByteDance Ltd.
Xiang Yin, ByteDance Ltd.
Chunfeng Wang, ByteDance Ltd.
Zhonghao Li, ByteDance Ltd.
Xinghua Qu, ByteDance Ltd.
Zhiqiang Xu, Mohamed Bin Zayed University of Artificial Intelligence
Zejun Ma, ByteDance Ltd.

Download

Document Type

Conference Proceeding

Abstract

In this paper, we propose a Self-heuristic Speaker Content Disentanglement (S2CD) model for any to any voice conversion without using any external resources, e.g., speaker labels or vectors, linguistic models, and transcriptions. S2CD is built on the disentanglement sequential variational autoencoder (DSVAE), but improves DSVAE structure at the model architecture level from three perspectives. Specifically, we develop different structures for speaker and content encoders based on their underlying static/dynamic property. We further propose a generative graph, modelled by S2CD, so as to make S2CD well mimic the multi-speaker speech generation process. Finally, we propose a self-heuristic way to introduce bias to the prior modelling. Extensive empirical evaluations show the effectiveness of S2CD for any to any voice conversion.

DOI

10.21437/Interspeech.2023-215

Publication Date

8-1-2023

Keywords

any to any,
disentanglement,
voice conversion

Disciplines

Comments

Paper available in INTERSPEECH

Additional Links

DOI link: https://doi.org/10.21437/Interspeech.2023-215

Citation Information

P. Wei, "S2CD: Self-heuristic Speaker Content Disentanglement for Any-to-Any Voice Conversion", in Proceedings of the Annual Conf. of the Intl. Speech Communication Assoc. (INTERSPEECH 2023), vol 2023-August, pp. 2288-2292, Aug 2023. doi:10.21437/Interspeech.2023-215