Text this: Self-supervised learning of structural representations of visual objects