Detection¶
MXNet Pytorch
MXNet¶
Visualization of Inference Throughputs vs. Validation mAP of COCO pre-trained models is illustrated in the first graph.
We also provide a detailed interactive analysis of all 80 object categories.
The following tables list pre-trained models for object detection and their performances with more details.
Hint
Model attributes are coded in their names.
For instance, ssd_300_vgg16_atrous_voc
consists of four parts:
ssd
indicate the algorithm is “Single Shot Multibox Object Detection” 1.300
is the training image size, which means training images are resized to 300x300 and all anchor boxes are designed to match this shape. This may not apply to some models.vgg16_atrous
is the type of base feature extractor network.voc
is the training dataset. You can choosevoc
orcoco
, etc.(320x320)
indicate that the model was evaluated with resolution 320x320. If not otherwise specified, all detection models in GluonCV can take various input shapes for prediction. Some models are trained with various input data shapes, e.g., Faster-RCNN and YOLO models.ssd_300_vgg16_atrous_voc_int8
is a quantized model calibrated on Pascal VOC dataset forssd_300_vgg16_atrous_voc
.
Hint
The training commands work with the following scripts:
For SSD 1 networks:
Download train_ssd.py
For Faster-RCNN 2 networks:
Download train_faster_rcnn.py
For YOLO v3 3 networks:
Download train_yolo3.py
Pascal VOC¶
Hint
For Pascal VOC dataset, training image set is the union of 2007trainval and 2012trainval and validation image set is 2007test.
The VOC metric, mean Average Precision (mAP) across all classes with IoU threshold 0.5 is reported.
Quantized SSD models are evaluated with nms_thresh=0.45
, nms_topk=200
.
SSD¶
Checkout SSD demo tutorial here: 01. Predict with pre-trained SSD models
Model |
mAP |
Training Command |
Training log |
---|---|---|---|
ssd_300_vgg16_atrous_voc 1 |
77.6 |
||
ssd_300_vgg16_atrous_voc_int8* 1 |
77.46 |
||
ssd_512_vgg16_atrous_voc 1 |
79.2 |
||
ssd_512_vgg16_atrous_voc_int8* 1 |
78.39 |
||
ssd_512_resnet50_v1_voc 1 |
80.1 |
||
ssd_512_resnet50_v1_voc_int8* 1 |
80.16 |
||
ssd_512_mobilenet1.0_voc 1 |
75.4 |
||
ssd_512_mobilenet1.0_voc_int8* 1 |
75.04 |
Faster-RCNN¶
Faster-RCNN models of VOC dataset are evaluated with native resolutions with shorter side >= 600
but longer side <= 1000
without changing aspect ratios.
Checkout Faster-RCNN demo tutorial here: 02. Predict with pre-trained Faster RCNN models
Model |
mAP |
Training Command |
Training log |
---|---|---|---|
faster_rcnn_resnet50_v1b_voc 2 |
78.3 |
YOLO-v3¶
YOLO-v3 models can be evaluated and used for prediction at different resolutions. Different mAPs are reported with various evaluation resolutions, however, the models are identical.
Checkout YOLO demo tutorial here: 03. Predict with pre-trained YOLO models
Model |
mAP |
Training Command |
Training log |
---|---|---|---|
yolo3_darknet53_voc 3 (320x320) |
79.3 |
||
yolo3_darknet53_voc 3 (416x416) |
81.5 |
||
yolo3_mobilenet1.0_voc 3 (320x320) |
73.9 |
||
yolo3_mobilenet1.0_voc 3 (416x416) |
75.8 |
CenterNet¶
CenterNet models are evaluated at 512x512 resolution. mAPs with flipped inference(F) are also reported, however, the models are identical. Checkout CenterNet demo tutorial here: 11. Predict with pre-trained CenterNet models
Note that dcnv2
indicate that models include Modulated Deformable Convolution (DCNv2) layers, you may need to upgrade MXNet in order to use them.
Model |
mAP(Orig/F) |
Training Command |
Training log |
---|---|---|---|
center_net_resnet18_v1b_voc 6 |
66.8/69.5 |
||
center_net_resnet18_v1b_dcnv2_voc 6 |
71.2/74.7 |
||
center_net_resnet50_v1b_voc 6 |
71.8/76.1 |
||
center_net_resnet50_v1b_dcnv2_voc 6 |
75.6/78.7 |
||
center_net_resnet101_v1b_voc 6 |
75.5/78.2 |
||
center_net_resnet101_v1b_dcnv2_voc 6 |
76.7/79.2 |
MS COCO¶
Hint
For COCO dataset, training imageset is train2017 and validation imageset is val2017.
The COCO metric, Average Precision (AP) with IoU threshold 0.5:0.95 (averaged 10 values, AP 0.5:0.95), 0.5 (AP 0.5) and 0.75 (AP 0.75) are reported together in the format (AP 0.5:0.95)/(AP 0.5)/(AP 0.75).
For object detection task, only box overlap based AP is evaluated and reported.
SSD¶
Checkout SSD demo tutorial here: 01. Predict with pre-trained SSD models
Model |
Box AP |
Training Command |
Training Log |
---|---|---|---|
ssd_300_vgg16_atrous_coco 1 |
25.1/42.9/25.8 |
||
ssd_512_vgg16_atrous_coco 1 |
28.9/47.9/30.6 |
||
ssd_300_resnet34_v1b_coco 1 |
25.1/41.7/26.2 |
||
ssd_512_resnet50_v1_coco 1 |
30.6/50.0/32.2 |
||
ssd_512_mobilenet1.0_coco 1 |
21.7/39.2/21.3 |
Faster-RCNN¶
Faster-RCNN models of VOC dataset are evaluated with native resolutions with shorter side >= 800
but longer side <= 1333
without changing aspect ratios.
Checkout Faster-RCNN demo tutorial here: 02. Predict with pre-trained Faster RCNN models
Model |
Box AP |
Training Command |
Training Log |
---|---|---|---|
faster_rcnn_resnet50_v1b_coco 2 |
37.0/57.8/39.6 |
||
faster_rcnn_resnet101_v1d_coco 2 |
40.1/60.9/43.3 |
||
faster_rcnn_fpn_resnet50_v1b_coco 4 |
38.4/60.2/41.6 |
||
faster_rcnn_fpn_resnet101_v1d_coco 4 |
40.8/62.4/44.7 |
||
faster_rcnn_fpn_bn_resnet50_v1b_coco 5 |
39.3/61.3/42.9 |
||
faster_rcnn_fpn_syncbn_resnest50_coco 7 |
42.7/64.1/46.4 |
||
faster_rcnn_fpn_syncbn_resnest101_coco 7 |
44.9/66.4/48.9 |
||
faster_rcnn_fpn_syncbn_resnest269_coco 7 |
46.5/67.5/50.7 |
YOLO-v3¶
YOLO-v3 models can be evaluated and used for prediction at different resolutions. Different mAPs are reported with various evaluation resolutions.
Checkout YOLO demo tutorial here: 03. Predict with pre-trained YOLO models
Model |
Box AP |
Training Command |
Training Log |
---|---|---|---|
yolo3_darknet53_coco 3 (320x320) |
33.6/54.1/35.8 |
||
yolo3_darknet53_coco 3 (416x416) |
36.0/57.2/38.7 |
||
yolo3_darknet53_coco 3 (608x608) |
37.0/58.2/40.1 |
||
yolo3_mobilenet1.0_coco 3 (320x320) |
26.7/46.1/27.5 |
||
yolo3_mobilenet1.0_coco 3 (416x416) |
28.6/48.9/29.9 |
||
yolo3_mobilenet1.0_coco 3 (608x608) |
28.0/49.8/27.8 |
CenterNet¶
CenterNet models are evaluated at 512x512 resolution. mAPs with flipped inference(F) are also reported, however, the models are identical. Checkout CenterNet demo tutorial here: 11. Predict with pre-trained CenterNet models.
Note that dcnv2
indicate that models include Modulated Deformable Convolution (DCNv2) layers, you may need to upgrade MXNet in order to use them.
Model |
mAP(Orig/F) |
Training Command |
Training log |
---|---|---|---|
center_net_resnet18_v1b_coco 6 |
26.6/28.1 |
||
center_net_resnet18_v1b_dcnv2_coco 6 |
28.9/30.3 |
||
center_net_resnet50_v1b_coco 6 |
32.1/33.4 |
||
center_net_resnet50_v1b_dcnv2_coco 6 |
34.0/35.3 |
||
center_net_resnet101_v1b_coco 6 |
34.5/35.8 |
||
center_net_resnet101_v1b_dcnv2_coco 6 |
35.8/37.1 |
PyTorch¶
Models implemented using PyTorch will be added later. Please checkout our MXNet implementation instead.
Reference¶
- 1(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.
- 2(1,2,3,4)
Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. “Faster r-cnn: Towards real-time object detection with region proposal networks.” In Advances in neural information processing systems, pp. 91-99. 2015.
- 3(1,2,3,4,5,6,7,8,9,10,11)
Redmon, Joseph, and Ali Farhadi. “Yolov3: An incremental improvement.” arXiv preprint arXiv:1804.02767 (2018).
- 4(1,2)
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie. “Feature Pyramid Networks for Object Detection.” IEEE Conference on Computer Vision and Pattern Recognition 2017.
- 5
Kaiming He, Ross Girshick, Piotr Dollár. “Rethinking ImageNet Pre-training.” arXiv preprint arXiv:1811.08883 (2018).
- 6(1,2,3,4,5,6,7,8,9,10,11,12)
Zhou, Xingyi, Dequan Wang, and Philipp Krähenbühl. “Objects as Points.” arXiv preprint arXiv:1904.07850 (2019).
- 7(1,2,3)
Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Muller, R. Manmatha, Mu Li and Alex Smola “ResNeSt: Split-Attention Network” arXiv preprint (2020).