Text this: A Sequence-to-Sequence Framework Based on Transformer With Masked Language Model for Optical Music Recognition