Hierarchical disentangled representation learning for singing voice conversionView Publication
Conventional singing voice conversion (SVC) methods often suffer from operating in high-resolution audio owing to a high dimensionality of data. In this paper, we propose a hierarchical representation learning that enables the learning of disentangled representations with multiple resolutions independently. With the learned disentangled representations, the proposed method progressively performs SVC from low to high resolutions. Experimental results show that the proposed method outperforms baselines that operate with a single resolution in terms of mean opinion score (MOS), similarity score, and pitch accuracy.
Related PublicationsView All
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events
Kazuki Shimada, Archontis Politis*, Parthasaarathy Sudarsanam*, Daniel Krause*, Kengo Uchida, Sharath Adavanne*, Aapo Hakala*, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen*, Yuki MitsufujiWhile direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded […]
Automatic Piano Transcription with Hierarchical Frequency-Time Transformer
Keisuke Toyama, Taketo Akama, Yukara Ikemiya, Yuhta Takida, Wei-Hsiang Liao, Yuki MitsufujiTaking long-term spectral and temporal dependencies into account is essential for automatic piano transcriptio […]