Registration for ICASSP is free of charge, but registration is required to view the videos. If you have not yet registered, please visit: the full virtual conference by visiting: Your username is your email address and your password is your confirmation number/registration ID.

You need an account to view media

Sign in to view media

Don't have an account? Please contact us to request an account.

Speech Processing
Speech Recognition: Acoustic Modelling II


Titouan Parcollet

Date & Time

Fri, May 8, 2020

12:45 pm – 2:45 pm




Modern end-to-end (E2E) Automatic Speech Recognition (ASR) systems rely on Deep Neural Networks (DNN) that are mostly trained on handcrafted and pre-computed acoustic features such as Mel-filter-banks or Mel-frequency cepstral coefficients. Nonetheless, and despite worse performances, E2E ASR models processing raw waveforms are an active research field due to the lossless nature of the input signal. In this paper, we propose the E2E-SincNet, a novel fully E2E ASR model that goes from the raw waveform to the text transcripts by merging two recent and powerful paradigms: SincNet and the joint CTC-attention training scheme. The conducted experiments on two different speech recognition tasks show that our approach outperforms previously investigated E2E systems relying either on the raw waveform or pre-computed acoustic features, with a reported top-of-the-line Word Error Rate (WER) of 4.7% on the Wall Street Journal (WSJ) dataset.


Titouan Parcollet

University of Oxford
Sign in to join the conversationDon't have an account? Please contact us to request an account.
Sign in to view documentsDon't have an account? Please contact us to request an account.

Session Chairs

Dorothea Kolossa

Ruhr-Universität Bochum

Arun Narayanan

Google, Inc.