Text this: Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identification