You Only Look Once(YOLO):Unified, Real-Time Object Detection

You Only Look Once : Unified, Real-Time Object Detection 를 읽고 정리한 글입니다.


You Only Look Once : Unified, Real-Time Object Detection

Overview

(You should include contents of summary and introduction.)

기존의 object detection = repurpose classifiers to perform detection

  • DPM : sliding window
  • R-CNN : region proposal
  • Classifier를 통해 클래스 분류 -> post-processing 통해 refine -> rescore 후 합침

YOLO : a single network, end-to-end directly on detection performance

  • Unified Model (feat. GoogLeNet, NIN)
  • Real-time 가능할 정도로 빠름.
  • Single regression problem
    • Reasons globally about the image when making predictions
    • bbox coord. & class probability 동시에 구할 수 있음.
    • CNN -> Non-max. supprestion -> Finish!
  • Can learns generalizable representations of obj.
  • 다양한 datasets 가능하다.

GoogLeNet architecture

NIN(Network in Network)

  • 네트워크를 구성할 때 또 다른 micro network 포함하여 설계했을 때 성능 향상이 됨.

Methods

(Explain one of the methods that the thesis used.)

Unified Detection

  • 모든 bbox, class를 한번에 고려한다 (=> 즉, 전체 이미지로 한번 본다)

스크린샷 2020-07-31 오전 2 46 20

1
2
3
4
5
6
7
8
9
1. S x S grid
2. bbox & confidence score(bbox에서 물체가 많이 포함되는지)
    - confidence = Pr(object) * IOU
    - no obj -> 0(zero)
    - each bbox contains : x,y,w,h,confidence
    - x, y : bbox안의 cell 위치 (norm. 0~1)
    - w, h : bbox width, height (norm. 0~1)
    - c : bbox confidence
3. Class probability
  • \[Pr(class(i)|Object) * Pr(Object)\]
  • Predictions are encoded as an S x S x (B*5 + C)
  • On Pascal Voc, S=7, B=2. C(class) = 20 - Thus, 7 x 7 x 30 tensor.

Architecture

스크린샷 2020-07-31 오전 2 46 53

  • GoogLeNet을 응용
  • 24 conv. layer + 2 FC layer
    • Fast YOLO(tiny YOLO) : 9 conv layer 사용함.
  • GoogLeNet에서 쓰이는 Inception module 대신, Reduction layer(1x1 conv)를 사용하여 파라미터 size 줄임.
  • Feature Extractor에 해당하는 20 conv layer로 pretrain함.
    • Pretrained layer 바탕으로 VOC data에 fine-tuning
  • 좀 더 좋은 detection을 위해 224 -> 448로 size 2배 늘림.
  • NIN(network in network) - 4 conv layer, 2 FC layer => classifier 역할

Loss function

스크린샷 2020-07-31 오전 2 49 08

  • MSE보다 SSE가 단순하니 SSE로 선택
  • Object의 유무는 grid cell 자체에서 cell 기준으로 classify하므로 유무를 판단하는 loss보다는 bbox 좌표를 찍는 loss를 더 크게 봄.
    • lambda(coord) = 5 and lambda(noobj) = 0.5
  • bbox 중에서도 larger box는 오차로 인한 로스 변동이 클 것이므로 루트를 씌워서 작은 bbox에는 전보다 더 크게 반응, 큰 bbox에는 전보다 조금 더 작게 반응하도록 함.
  • 1(obj,i) : if object appears in cell i
  • 1(obj,ij) : the j-th bbox predictor in cell i is “responsible” for that prediction

Limitations

  • Bbox가 grid size에 의존.
    • Image가 domain에 자주 나타나는 object size를 제대로 반영하지 못한다면 bad
  • Grid size보다 작은 물체가 있을 경우 bad

Experiment rusults

스크린샷 2020-07-31 오전 2 53 35

스크린샷 2020-07-31 오전 2 54 10

  • bg 관련해서 덜 실수함.

  • Fast R-CNN과 combine했더니 속도 떨어짐

스크린샷 2020-07-31 오전 2 54 32

  • 새로운 dataset에도 잘 작동함.

Code

  • pytorch

  • network implementation

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
      # feature extractor 부분 
          feature_extract_net = nn.Sequential(
              nn.Conv2d(3, 64, 7, stride=2, padding=3),
              nn.LeakyReLU(0.1, inplace=True),
              nn.MaxPool2d(2),
    
              nn.Conv2d(64, 192, 3, padding=1),
              nn.LeakyReLU(0.1, inplace=True),
              nn.MaxPool2d(2),
    
              nn.Conv2d(192, 128, 1),
              nn.LeakyReLU(0.1, inplace=True),
              nn.Conv2d(128, 256, 3, padding=1),
              nn.LeakyReLU(0.1, inplace=True),
              nn.Conv2d(256, 256, 1),
              nn.LeakyReLU(0.1, inplace=True),
              nn.Conv2d(256, 512, 3, padding=1),
              nn.LeakyReLU(0.1, inplace=True),
              nn.MaxPool2d(2),
    
              nn.Conv2d(512, 256, 1),
              nn.LeakyReLU(0.1, inplace=True),
              nn.Conv2d(256, 512, 3, padding=1),
              nn.LeakyReLU(0.1, inplace=True),
              nn.Conv2d(512, 256, 1),
              nn.LeakyReLU(0.1, inplace=True),
              nn.Conv2d(256, 512, 3, padding=1),
              nn.LeakyReLU(0.1, inplace=True),
              nn.Conv2d(512, 256, 1),
              nn.LeakyReLU(0.1, inplace=True),
              nn.Conv2d(256, 512, 3, padding=1),
              nn.LeakyReLU(0.1, inplace=True),
              nn.Conv2d(512, 256, 1),
              nn.LeakyReLU(0.1, inplace=True),
              nn.Conv2d(256, 512, 3, padding=1),
              nn.LeakyReLU(0.1, inplace=True),
              nn.Conv2d(512, 512, 1),
              nn.LeakyReLU(0.1, inplace=True),
              nn.Conv2d(512, 1024, 3, padding=1),
              nn.LeakyReLU(0.1, inplace=True),
              nn.MaxPool2d(2),
    
              nn.Conv2d(1024, 512, 1),
              nn.LeakyReLU(0.1, inplace=True),
              nn.Conv2d(512, 1024, 3, padding=1),
              nn.LeakyReLU(0.1, inplace=True),
              nn.Conv2d(1024, 512, 1),
              nn.LeakyReLU(0.1, inplace=True),
              nn.Conv2d(512, 1024, 3, padding=1),
              nn.LeakyReLU(0.1, inplace=True)
          )
    
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
      # classifier 부분
      conv = nn.Sequential(
                  # 4 conv layer
                  nn.Conv2d(1024, 1024, 3, padding=1),
                  nn.LeakyReLU(0.1, inplace=True),
                  nn.Conv2d(1024, 1024, 3, stride=2, padding=1),
                  nn.LeakyReLU(0.1),
    
                  nn.Conv2d(1024, 1024, 3, padding=1),
                  nn.LeakyReLU(0.1, inplace=True),
                  nn.Conv2d(1024, 1024, 3, padding=1),
                  nn.LeakyReLU(0.1, inplace=True)
                  Flatten(),
    
                  # 2 FC layer
                  nn.Linear(7 * 7 * 1024, 4096),
                  nn.LeakyReLU(0.1, inplace=True),
                  nn.Dropout(0.5, inplace=False), 
                  nn.Linear(4096, S * S * (5 * B + C)),
                  nn.Sigmoid()
          )
    

Additional studies

(If you have some parts that cannot understand, you have to do additional studies for them. It’s optional.)

NIN(network in network)

References

(References for your additional studies)

https://www.youtube.com/watch?v=eTDcoeqj1_w

https://arclab.tistory.com/162

https://arclab.tistory.com/165