Dynamic Head: Unifying Object Detection Heads with Attentions

Posted Nov 11, 2024

1 min read

Summary

FPN 피쳐를 input으로 attention 활용 scale, spatial, task를 이해하는 head 제안함.

Method

Scale-aware Attention $ \pi_L $

avg pool. -> 1x1 conv layer & relu -> hard-sigmoid layer
pyramid feature는 level에 따른 semantic gap이 존재할 수 있음. 이 gap을 해소하고 피쳐 레벨별 adaptive하게 만들어줄 수 있는 모듈임.

\[\pi_L(F) \cdot F = \sigma (f(\frac 1 {SC} \sum_{S,C} F)) \cdot F\]

Spatial-aware Attention $ \pi_S $

deformable convolution 활용

$\pi_S(F) \cdot F = \frac 1 L \sum^L_{l=1} \sum^K_{k=1} w_{l,k} \cdot F(l;p_k + \Delta p_k; c) \cdot \Delta m_k$ , where $K$ is the number of sparse sampling locations, $ p_k+\Delta p_k $ is a shifted location by the self-learned spatial offset and $\Delta m_k$ is a self-learned importance scalar at location $p_k$

Task-aware Attention $\pi_C$

avg pool. -> fc & norm. -> shifted sigmoid layer
switches ON and OFF channels of features

$\pi_C(F) \cdot F = max (\alpha^1(F)\cdot F_c + \beta^1(F), \alpha^2(F)\cdot F_c + \beta^2(F))$ , where $[\alpha^1, \alpha^2, \beta^1, \beta^2]^T = \theta(\cdot)$ is a hyper function that learns to control the activateion thresholds.

합체: Dynamic Head

$W(F) = \pi_C(\pi_S(\pi_L(F)\cdot F)\cdot F)\cdot F$

이 아웃풋이 object detection이나 center/box regression에 활용가능한 피쳐

Experiments

여러 실험들이 있지만, backbone feature를 (deformable variant of) ResNext-101-64x4d 로 썼을 때 detection 태스크에서 SOTA 달성 했음.

Paper Review, Vision

Detection Attention

This post is licensed under CC BY 4.0 by the author.