You Only Look Once(YOLO):Unified, Real-Time Object Detection
You Only Look Once : Unified, Real-Time Object Detection 를 읽고 정리한 글입니다.
You Only Look Once : Unified, Real-Time Object Detection
Overview
(You should include contents of summary and introduction.)
기존의 object detection = repurpose classifiers to perform detection
- DPM : sliding window
- R-CNN : region proposal
- Classifier를 통해 클래스 분류 -> post-processing 통해 refine -> rescore 후 합침
YOLO : a single network, end-to-end directly on detection performance
- Unified Model (feat. GoogLeNet, NIN)
- Real-time 가능할 정도로 빠름.
- Single regression problem
- Reasons globally about the image when making predictions
- bbox coord. & class probability 동시에 구할 수 있음.
- CNN -> Non-max. supprestion -> Finish!
- Can learns generalizable representations of obj.
- 다양한 datasets 가능하다.
Related work (Basic concepts)
GoogLeNet architecture
NIN(Network in Network)
- 네트워크를 구성할 때 또 다른 micro network 포함하여 설계했을 때 성능 향상이 됨.
Methods
(Explain one of the methods that the thesis used.)
Unified Detection
- 모든 bbox, class를 한번에 고려한다 (=> 즉, 전체 이미지로 한번 본다)
1
2
3
4
5
6
7
8
9
1. S x S grid
2. bbox & confidence score(bbox에서 물체가 많이 포함되는지)
- confidence = Pr(object) * IOU
- no obj -> 0(zero)
- each bbox contains : x,y,w,h,confidence
- x, y : bbox안의 cell 위치 (norm. 0~1)
- w, h : bbox width, height (norm. 0~1)
- c : bbox confidence
3. Class probability
- \[Pr(class(i)|Object) * Pr(Object)\]
- Predictions are encoded as an S x S x (B*5 + C)
- On Pascal Voc, S=7, B=2. C(class) = 20 - Thus, 7 x 7 x 30 tensor.
Architecture
- GoogLeNet을 응용
- 24 conv. layer + 2 FC layer
- Fast YOLO(tiny YOLO) : 9 conv layer 사용함.
- GoogLeNet에서 쓰이는 Inception module 대신, Reduction layer(1x1 conv)를 사용하여 파라미터 size 줄임.
- Feature Extractor에 해당하는 20 conv layer로 pretrain함.
- Pretrained layer 바탕으로 VOC data에 fine-tuning
- 좀 더 좋은 detection을 위해 224 -> 448로 size 2배 늘림.
- NIN(network in network) - 4 conv layer, 2 FC layer => classifier 역할
Loss function
- MSE보다 SSE가 단순하니 SSE로 선택
- Object의 유무는 grid cell 자체에서 cell 기준으로 classify하므로 유무를 판단하는 loss보다는 bbox 좌표를 찍는 loss를 더 크게 봄.
- lambda(coord) = 5 and lambda(noobj) = 0.5
- bbox 중에서도 larger box는 오차로 인한 로스 변동이 클 것이므로 루트를 씌워서 작은 bbox에는 전보다 더 크게 반응, 큰 bbox에는 전보다 조금 더 작게 반응하도록 함.
- 1(obj,i) : if object appears in cell i
- 1(obj,ij) : the j-th bbox predictor in cell i is “responsible” for that prediction
Limitations
- Bbox가 grid size에 의존.
- Image가 domain에 자주 나타나는 object size를 제대로 반영하지 못한다면 bad
- Grid size보다 작은 물체가 있을 경우 bad
Experiment rusults
bg 관련해서 덜 실수함.
Fast R-CNN과 combine했더니 속도 떨어짐
- 새로운 dataset에도 잘 작동함.
Code
pytorch
network implementation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
# feature extractor 부분 feature_extract_net = nn.Sequential( nn.Conv2d(3, 64, 7, stride=2, padding=3), nn.LeakyReLU(0.1, inplace=True), nn.MaxPool2d(2), nn.Conv2d(64, 192, 3, padding=1), nn.LeakyReLU(0.1, inplace=True), nn.MaxPool2d(2), nn.Conv2d(192, 128, 1), nn.LeakyReLU(0.1, inplace=True), nn.Conv2d(128, 256, 3, padding=1), nn.LeakyReLU(0.1, inplace=True), nn.Conv2d(256, 256, 1), nn.LeakyReLU(0.1, inplace=True), nn.Conv2d(256, 512, 3, padding=1), nn.LeakyReLU(0.1, inplace=True), nn.MaxPool2d(2), nn.Conv2d(512, 256, 1), nn.LeakyReLU(0.1, inplace=True), nn.Conv2d(256, 512, 3, padding=1), nn.LeakyReLU(0.1, inplace=True), nn.Conv2d(512, 256, 1), nn.LeakyReLU(0.1, inplace=True), nn.Conv2d(256, 512, 3, padding=1), nn.LeakyReLU(0.1, inplace=True), nn.Conv2d(512, 256, 1), nn.LeakyReLU(0.1, inplace=True), nn.Conv2d(256, 512, 3, padding=1), nn.LeakyReLU(0.1, inplace=True), nn.Conv2d(512, 256, 1), nn.LeakyReLU(0.1, inplace=True), nn.Conv2d(256, 512, 3, padding=1), nn.LeakyReLU(0.1, inplace=True), nn.Conv2d(512, 512, 1), nn.LeakyReLU(0.1, inplace=True), nn.Conv2d(512, 1024, 3, padding=1), nn.LeakyReLU(0.1, inplace=True), nn.MaxPool2d(2), nn.Conv2d(1024, 512, 1), nn.LeakyReLU(0.1, inplace=True), nn.Conv2d(512, 1024, 3, padding=1), nn.LeakyReLU(0.1, inplace=True), nn.Conv2d(1024, 512, 1), nn.LeakyReLU(0.1, inplace=True), nn.Conv2d(512, 1024, 3, padding=1), nn.LeakyReLU(0.1, inplace=True) )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
# classifier 부분 conv = nn.Sequential( # 4 conv layer nn.Conv2d(1024, 1024, 3, padding=1), nn.LeakyReLU(0.1, inplace=True), nn.Conv2d(1024, 1024, 3, stride=2, padding=1), nn.LeakyReLU(0.1), nn.Conv2d(1024, 1024, 3, padding=1), nn.LeakyReLU(0.1, inplace=True), nn.Conv2d(1024, 1024, 3, padding=1), nn.LeakyReLU(0.1, inplace=True) Flatten(), # 2 FC layer nn.Linear(7 * 7 * 1024, 4096), nn.LeakyReLU(0.1, inplace=True), nn.Dropout(0.5, inplace=False), nn.Linear(4096, S * S * (5 * B + C)), nn.Sigmoid() )
Additional studies
(If you have some parts that cannot understand, you have to do additional studies for them. It’s optional.)
NIN(network in network)
References
(References for your additional studies)
https://www.youtube.com/watch?v=eTDcoeqj1_w
This post is licensed under CC BY 4.0 by the author.