Anfonwch hwn fel neges destun: Visual object segmentation based on temporal and linguistic cues