Transformer 5
- COOT:Cooperative Hierarchical Transformer for Video-Text Representation Learning
- Pre-training of Deep Bidirectional Transformers for Language Understanding(BERT)
- DETR:End-to-End Object Detection with Transformers
- An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale(ViT)
- Attention Is All You Need