Describir: Object level grouping for video shots