Registration for ICASSP is free of charge, but registration is required to view the videos. If you have not yet registered, please visit: https://cmsworkshops.com/ICASSP2020/Registration.asp.Access the full virtual conference by visiting: https://2020.ieeeicassp-virtual.org/attendee/login. Your username is your email address and your password is your confirmation number/registration ID.

You need an account to view media

Sign in to view media

Don't have an account? Please contact us to request an account.

Speech Processing
SPE-L5.3
Lecture
Speech Synthesis and Voice Conversion I

MELLOTRON: MULTISPEAKER EXPRESSIVE VOICE SYNTHESIS BY CONDITIONING ON RHYTHM, PITCH AND GLOBAL STYLE TOKENS

Jason Li

Date & Time

Wed, May 6, 2020

10:00 am – 12:00 pm

Location

On-Demand

Abstract

Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data. By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice. Unlike other methods, we train Mellotron using only read speech data without alignments between text and audio. We evaluate our models using the LJSpeech and LibriTTS datasets. We provide F0 Frame Errors and synthesized samples that include style transfer from other speakers, singers and styles not seen during training, procedural manipulation of rhythm and pitch and choir synthesis.


Presenter

Sign in to join the conversationDon't have an account? Please contact us to request an account.
Sign in to view documentsDon't have an account? Please contact us to request an account.

Session Chair

Junichi Yamagishi

National Institute of Informatics