Text this: User-Guided Clustering for Video Segmentation on Coarse-Grained Feature Extraction