Dynamic Head: Unifying Object Detection Heads with Attentions
Summary
FPN 피쳐를 input으로 attention 활용 scale, spatial, task를 이해하는 head 제안함.
Method
Scale-aware Attention $ \pi_L $
- avg pool. -> 1x1 conv layer & relu -> hard-sigmoid layer
- pyramid feature는 level에 따른 semantic gap이 존재할 수 있음. 이 gap을 해소하고 피쳐 레벨별 adaptive하게 만들어줄 수 있는 모듈임.
Spatial-aware Attention $ \pi_S $
- deformable convolution 활용
\(\pi_S(F) \cdot F = \frac 1 L \sum^L_{l=1} \sum^K_{k=1} w_{l,k} \cdot F(l;p_k + \Delta p_k; c) \cdot \Delta m_k\) , where $K$ is the number of sparse sampling locations, $ p_k+\Delta p_k $ is a shifted location by the self-learned spatial offset and $\Delta m_k$ is a self-learned importance scalar at location $p_k$
Task-aware Attention $\pi_C$
- avg pool. -> fc & norm. -> shifted sigmoid layer
- switches ON and OFF channels of features
\(\pi_C(F) \cdot F = max (\alpha^1(F)\cdot F_c + \beta^1(F), \alpha^2(F)\cdot F_c + \beta^2(F))\) , where $[\alpha^1, \alpha^2, \beta^1, \beta^2]^T = \theta(\cdot)$ is a hyper function that learns to control the activateion thresholds.
합체: Dynamic Head
\(W(F) = \pi_C(\pi_S(\pi_L(F)\cdot F)\cdot F)\cdot F\)
- 이 아웃풋이 object detection이나 center/box regression에 활용가능한 피쳐
Experiments
- 여러 실험들이 있지만, backbone feature를 (deformable variant of)
ResNext-101-64x4d
로 썼을 때 detection 태스크에서 SOTA 달성 했음.
This post is licensed under CC BY 4.0 by the author.