

Obtaining such information at this level of precision via manual annotation is factually impossible, at least in the acoustic domain. In addition to the intrinsic difficulty of this task, with different ways in which the same musical passage can be typeset and played, we also face a severe data problem: training such a network requires large amounts of fine-grained annotations between note positions on the sheet image and in the audio.

The task here is for a neural network to predict, at any time, the most likely position in the sheet image, in the form of a bounding box around the notes that match the incoming audio signal. In this article we build upon, and extend, our current state-of-the-art approach that frames sheet-image-based score following as a multi-modal conditional bounding box regression task ( Henkel and Widmer, 2021). More specifically, in previous work we have shown how neural networks can be trained to simultaneously listen to an incoming musical performance (audio) and read along in the score (image) ( Henkel et al., 2019, Henkel et al., 2020 Henkel and Widmer, 2021), thus opening up a challenging multi-modal machine learning problem. Recent advances in deep learning promise to overcome this problem by permitting us to perform score following directly on score sheet images (printouts, scans), which does not require any pre-processing or manually created score representations. While the former is time-consuming and tedious (think of typesetting an entire Beethoven sonata or Mahler symphony), automatic extraction via OMR may require substantial manual corrections as well, depending on the complexity and quality of the score. However, these kinds of representations are often not readily available and have to be either created by hand or automatically extracted from the printed scores using optical music recognition (OMR) ( Calvo-Zaragoza et al., 2019). Existing approaches usually rely on symbolic computer-readable score representations such as MIDI or MusicXML ( Orio et al., 2003 Dixon, 2005 Cont, 2006 Nakamura et al., 2015 Arzt, 2016). Score following systems can be used for a variety of applications including automatic page turning for musicians ( Arzt et al., 2008), displaying synchronized information in concert halls ( Arzt et al., 2015), and automatic accompaniment for solo musicians ( Cont, 2010 Raphael, 2010 Cancino-Chacón et al., 2017a). In other words, the task is for a machine to listen to a musical recording or performance and be able to follow along in the sheet music of the respective piece, with a certain robustness to peculiarities of the particular live performance ‐ such as unpredictable tempo and tempo changes, mistakes by the performing musicians, etc. Score following or real-time audio-to-score alignment aims at synchronizing musical performances (audio) to the corresponding scores (the printed sheet music from which the musicians are presumably playing) in an on-line fashion.

This allows us to use a large collection of real-world piano performance recordings coarsely aligned to scanned score sheet images and, as a consequence, improve over current state-of-the-art approaches. In this work, we propose a method that does not solely rely on note alignments but is additionally capable of leveraging data with annotations of lower granularity, such as bar or score system alignments. However, these kinds of annotations are difficult and costly to obtain, which is why research in this area mainly utilizes synthetic audio and sheet images to train and evaluate score following systems. Training a system that can solve this task robustly with live audio and real sheet music (i.e., scans or score images) requires precise ground truth alignments between audio and note-coordinate positions in the score sheet images. The task of real-time alignment between a music performance and the corresponding score (sheet music), also known as score following, poses a challenging multi-modal machine learning problem. 2LIT Artificial Intelligence Lab, Johannes Kepler University, Linz, Austria.

1Institute of Computational Perception, Johannes Kepler University, Linz, Austria.
