Automatic Piano Transcription with Hierarchical Frequency-Time TransformerView Publication
Taking long-term spectral and temporal dependencies into account is essential for automatic piano transcription. This is especially helpful when determining the precise onset and offset for each note in the polyphonic piano content. In this case, we may rely on the capability of self-attention mechanism in Transformers to capture these long-term dependencies in the frequency and time axes. In this work, we propose hFT-Transformer, which is an automatic music transcription method that uses a two-level hierarchical frequency-time Transformer architecture. The first hierarchy includes a convolutional block in the time axis, a Transformer encoder in the frequency axis, and a Transformer decoder that converts the dimension in the frequency axis. The output is then fed into the second hierarchy which consists of another Transformer encoder in the time axis. We evaluated our method with the widely used MAPS and MAESTRO v3.0.0 datasets, and it demonstrated state-of-the-art performance on all the F1-scores of the metrics among Frame, Note, Note with Offset, and Note with Offset and Velocity estimations.
Related PublicationsView All
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events
Kazuki Shimada, Archontis Politis*, Parthasaarathy Sudarsanam*, Daniel Krause*, Kengo Uchida, Sharath Adavanne*, Aapo Hakala*, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen*, Yuki MitsufujiWhile direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded […]
Extending Audio Masked Autoencoders Toward Audio Restoration
Zhi Zhong, Hao Shi*, Masato Hirano, Kazuki Shimada, Kazuya Tateishi, Takashi Shibuya, Shusuke Takahashi, Yuki MitsufujiAudio classification and restoration are among major downstream tasks in audio signal processing. However, res […]