Text this: Attention‐based spatial–temporal hierarchical ConvLSTM network for action recognition in videos