Registration for ICASSP is free of charge, but registration is required to view the videos. If you have not yet registered, please visit: the full virtual conference by visiting: Your username is your email address and your password is your confirmation number/registration ID.

You need an account to view media

Sign in to view media

Don't have an account? Please contact us to request an account.

Machine Learning for Signal Processing
Applications in Speech and Audio


Marco Tagliasacchi

Date & Time

Tue, May 5, 2020

12:30 pm – 2:30 pm




We present a method to estimate the fundamental frequency in monophonic audio, often referred to as pitch estimation. In contrast to existing methods, our neural network can be fully trained only on unlabeled data, using self-supervision. A tiny amount of labeled data is needed solely for mapping the network outputs to absolute pitch values. The key to this is the observation that if one creates two examples from one original audio clip by pitch shifting both, the difference between the correct outputs is known, without even knowing the actual pitch value in the original clip. Somewhat surprisingly, this idea combined with an auxiliary reconstruction loss allows training a pitch estimation model. Our results show that our pitch estimation method obtains an accuracy comparable to fully supervised models on monophonic audio, without the need for large labeled datasets. In addition, we are able to train a voicing detection output in the same model, again without using any labels.


Sign in to join the conversationDon't have an account? Please contact us to request an account.
Sign in to view documentsDon't have an account? Please contact us to request an account.

Session Chair

Ritwik Giri

Amazon Web Services