Post

Dynamic Head: Unifying Object Detection Heads with Attentions

Summary

FPN 피쳐를 input으로 attention 활용 scale, spatial, task를 이해하는 head 제안함.


Method

Scale-aware Attention $ \pi_L $

  • avg pool. -> 1x1 conv layer & relu -> hard-sigmoid layer
  • pyramid feature는 level에 따른 semantic gap이 존재할 수 있음. 이 gap을 해소하고 피쳐 레벨별 adaptive하게 만들어줄 수 있는 모듈임.
\[\pi_L(F) \cdot F = \sigma (f(\frac 1 {SC} \sum_{S,C} F)) \cdot F\]

Spatial-aware Attention $ \pi_S $

  • deformable convolution 활용

\(\pi_S(F) \cdot F = \frac 1 L \sum^L_{l=1} \sum^K_{k=1} w_{l,k} \cdot F(l;p_k + \Delta p_k; c) \cdot \Delta m_k\) , where $K$ is the number of sparse sampling locations, $ p_k+\Delta p_k $ is a shifted location by the self-learned spatial offset and $\Delta m_k$ is a self-learned importance scalar at location $p_k$

Task-aware Attention $\pi_C$

  • avg pool. -> fc & norm. -> shifted sigmoid layer
  • switches ON and OFF channels of features

\(\pi_C(F) \cdot F = max (\alpha^1(F)\cdot F_c + \beta^1(F), \alpha^2(F)\cdot F_c + \beta^2(F))\) , where $[\alpha^1, \alpha^2, \beta^1, \beta^2]^T = \theta(\cdot)$ is a hyper function that learns to control the activateion thresholds.

합체: Dynamic Head

\(W(F) = \pi_C(\pi_S(\pi_L(F)\cdot F)\cdot F)\cdot F\)

  • 이 아웃풋이 object detection이나 center/box regression에 활용가능한 피쳐

Experiments

  • 여러 실험들이 있지만, backbone feature를 (deformable variant of) ResNext-101-64x4d 로 썼을 때 detection 태스크에서 SOTA 달성 했음.
This post is licensed under CC BY 4.0 by the author.