Registration for ICASSP is free of charge, but registration is required to view the videos. If you have not yet registered, please visit: https://cmsworkshops.com/ICASSP2020/Registration.asp.Access the full virtual conference by visiting: https://2020.ieeeicassp-virtual.org/attendee/login. Your username is your email address and your password is your confirmation number/registration ID.
We present a method to estimate the fundamental frequency in monophonic audio, often referred to as pitch estimation. In contrast to existing methods, our neural network can be fully trained only on unlabeled data, using self-supervision. A tiny amount of labeled data is needed solely for mapping the network outputs to absolute pitch values. The key to this is the observation that if one creates two examples from one original audio clip by pitch shifting both, the difference between the correct outputs is known, without even knowing the actual pitch value in the original clip. Somewhat surprisingly, this idea combined with an auxiliary reconstruction loss allows training a pitch estimation model. Our results show that our pitch estimation method obtains an accuracy comparable to fully supervised models on monophonic audio, without the need for large labeled datasets. In addition, we are able to train a voicing detection output in the same model, again without using any labels.