Video Understanding 1 COOT:Cooperative Hierarchical Transformer for Video-Text Representation Learning Jul 4, 2021