Text this: Unified image and video saliency modeling