Detection

MXNet Pytorch

MXNet

Visualization of Inference Throughputs vs. Validation mAP of COCO pre-trained models is illustrated in the first graph.

../_images/plot_help.png detection throughputs

We also provide a detailed interactive analysis of all 80 object categories.

detection coco per class

The following tables list pre-trained models for object detection and their performances with more details.

Hint

Model attributes are coded in their names. For instance, ssd_300_vgg16_atrous_voc consists of four parts:

  • ssd indicate the algorithm is “Single Shot Multibox Object Detection” 1.

  • 300 is the training image size, which means training images are resized to 300x300 and all anchor boxes are designed to match this shape. This may not apply to some models.

  • vgg16_atrous is the type of base feature extractor network.

  • voc is the training dataset. You can choose voc or coco, etc.

  • (320x320) indicate that the model was evaluated with resolution 320x320. If not otherwise specified, all detection models in GluonCV can take various input shapes for prediction. Some models are trained with various input data shapes, e.g., Faster-RCNN and YOLO models.

  • ssd_300_vgg16_atrous_voc_int8 is a quantized model calibrated on Pascal VOC dataset for ssd_300_vgg16_atrous_voc.

Hint

The training commands work with the following scripts:

Pascal VOC

Hint

For Pascal VOC dataset, training image set is the union of 2007trainval and 2012trainval and validation image set is 2007test.

The VOC metric, mean Average Precision (mAP) across all classes with IoU threshold 0.5 is reported.

Quantized SSD models are evaluated with nms_thresh=0.45, nms_topk=200.

SSD

Checkout SSD demo tutorial here: 01. Predict with pre-trained SSD models

Model

mAP

Training Command

Training log

ssd_300_vgg16_atrous_voc 1

77.6

shell script

log

ssd_300_vgg16_atrous_voc_int8* 1

77.46

ssd_512_vgg16_atrous_voc 1

79.2

shell script

log

ssd_512_vgg16_atrous_voc_int8* 1

78.39

ssd_512_resnet50_v1_voc 1

80.1

shell script

log

ssd_512_resnet50_v1_voc_int8* 1

80.16

ssd_512_mobilenet1.0_voc 1

75.4

shell script

log

ssd_512_mobilenet1.0_voc_int8* 1

75.04

Faster-RCNN

Faster-RCNN models of VOC dataset are evaluated with native resolutions with shorter side >= 600 but longer side <= 1000 without changing aspect ratios.

Checkout Faster-RCNN demo tutorial here: 02. Predict with pre-trained Faster RCNN models

Model

mAP

Training Command

Training log

faster_rcnn_resnet50_v1b_voc 2

78.3

shell script

log

YOLO-v3

YOLO-v3 models can be evaluated and used for prediction at different resolutions. Different mAPs are reported with various evaluation resolutions, however, the models are identical.

Checkout YOLO demo tutorial here: 03. Predict with pre-trained YOLO models

Model

mAP

Training Command

Training log

yolo3_darknet53_voc 3 (320x320)

79.3

shell script

log

yolo3_darknet53_voc 3 (416x416)

81.5

shell script

log

yolo3_mobilenet1.0_voc 3 (320x320)

73.9

shell script

log

yolo3_mobilenet1.0_voc 3 (416x416)

75.8

shell script

log

CenterNet

CenterNet models are evaluated at 512x512 resolution. mAPs with flipped inference(F) are also reported, however, the models are identical. Checkout CenterNet demo tutorial here: 11. Predict with pre-trained CenterNet models

Note that dcnv2 indicate that models include Modulated Deformable Convolution (DCNv2) layers, you may need to upgrade MXNet in order to use them.

Model

mAP(Orig/F)

Training Command

Training log

center_net_resnet18_v1b_voc 6

66.8/69.5

shell script

log

center_net_resnet18_v1b_dcnv2_voc 6

71.2/74.7

shell script

log

center_net_resnet50_v1b_voc 6

71.8/76.1

shell script

log

center_net_resnet50_v1b_dcnv2_voc 6

75.6/78.7

shell script

log

center_net_resnet101_v1b_voc 6

75.5/78.2

shell script

log

center_net_resnet101_v1b_dcnv2_voc 6

76.7/79.2

shell script

log

MS COCO

Hint

For COCO dataset, training imageset is train2017 and validation imageset is val2017.

The COCO metric, Average Precision (AP) with IoU threshold 0.5:0.95 (averaged 10 values, AP 0.5:0.95), 0.5 (AP 0.5) and 0.75 (AP 0.75) are reported together in the format (AP 0.5:0.95)/(AP 0.5)/(AP 0.75).

For object detection task, only box overlap based AP is evaluated and reported.

SSD

Checkout SSD demo tutorial here: 01. Predict with pre-trained SSD models

Model

Box AP

Training Command

Training Log

ssd_300_vgg16_atrous_coco 1

25.1/42.9/25.8

shell script

log

ssd_512_vgg16_atrous_coco 1

28.9/47.9/30.6

shell script

log

ssd_300_resnet34_v1b_coco 1

25.1/41.7/26.2

shell script

log

ssd_512_resnet50_v1_coco 1

30.6/50.0/32.2

shell script

log

ssd_512_mobilenet1.0_coco 1

21.7/39.2/21.3

shell script

log

Faster-RCNN

Faster-RCNN models of VOC dataset are evaluated with native resolutions with shorter side >= 800 but longer side <= 1333 without changing aspect ratios.

Checkout Faster-RCNN demo tutorial here: 02. Predict with pre-trained Faster RCNN models

Model

Box AP

Training Command

Training Log

faster_rcnn_resnet50_v1b_coco 2

37.0/57.8/39.6

shell script

log

faster_rcnn_resnet101_v1d_coco 2

40.1/60.9/43.3

shell script

log

faster_rcnn_fpn_resnet50_v1b_coco 4

38.4/60.2/41.6

shell script

log

faster_rcnn_fpn_resnet101_v1d_coco 4

40.8/62.4/44.7

shell script

log

faster_rcnn_fpn_bn_resnet50_v1b_coco 5

39.3/61.3/42.9

shell script

log

faster_rcnn_fpn_syncbn_resnest50_coco 7

42.7/64.1/46.4

shell script

log

faster_rcnn_fpn_syncbn_resnest101_coco 7

44.9/66.4/48.9

shell script

log

faster_rcnn_fpn_syncbn_resnest269_coco 7

46.5/67.5/50.7

shell script

log

YOLO-v3

YOLO-v3 models can be evaluated and used for prediction at different resolutions. Different mAPs are reported with various evaluation resolutions.

Checkout YOLO demo tutorial here: 03. Predict with pre-trained YOLO models

Model

Box AP

Training Command

Training Log

yolo3_darknet53_coco 3 (320x320)

33.6/54.1/35.8

shell script

log

yolo3_darknet53_coco 3 (416x416)

36.0/57.2/38.7

shell script

log

yolo3_darknet53_coco 3 (608x608)

37.0/58.2/40.1

shell script

log

yolo3_mobilenet1.0_coco 3 (320x320)

26.7/46.1/27.5

shell script

log

yolo3_mobilenet1.0_coco 3 (416x416)

28.6/48.9/29.9

shell script

log

yolo3_mobilenet1.0_coco 3 (608x608)

28.0/49.8/27.8

shell script

log

CenterNet

CenterNet models are evaluated at 512x512 resolution. mAPs with flipped inference(F) are also reported, however, the models are identical. Checkout CenterNet demo tutorial here: 11. Predict with pre-trained CenterNet models.

Note that dcnv2 indicate that models include Modulated Deformable Convolution (DCNv2) layers, you may need to upgrade MXNet in order to use them.

Model

mAP(Orig/F)

Training Command

Training log

center_net_resnet18_v1b_coco 6

26.6/28.1

shell script

log

center_net_resnet18_v1b_dcnv2_coco 6

28.9/30.3

shell script

log

center_net_resnet50_v1b_coco 6

32.1/33.4

shell script

log

center_net_resnet50_v1b_dcnv2_coco 6

34.0/35.3

shell script

log

center_net_resnet101_v1b_coco 6

34.5/35.8

shell script

log

center_net_resnet101_v1b_dcnv2_coco 6

35.8/37.1

shell script

log

PyTorch

Models implemented using PyTorch will be added later. Please checkout our MXNet implementation instead.

Reference

1(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

2(1,2,3,4)

Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. “Faster r-cnn: Towards real-time object detection with region proposal networks.” In Advances in neural information processing systems, pp. 91-99. 2015.

3(1,2,3,4,5,6,7,8,9,10,11)

Redmon, Joseph, and Ali Farhadi. “Yolov3: An incremental improvement.” arXiv preprint arXiv:1804.02767 (2018).

4(1,2)

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie. “Feature Pyramid Networks for Object Detection.” IEEE Conference on Computer Vision and Pattern Recognition 2017.

5

Kaiming He, Ross Girshick, Piotr Dollár. “Rethinking ImageNet Pre-training.” arXiv preprint arXiv:1811.08883 (2018).

6(1,2,3,4,5,6,7,8,9,10,11,12)

Zhou, Xingyi, Dequan Wang, and Philipp Krähenbühl. “Objects as Points.” arXiv preprint arXiv:1904.07850 (2019).

7(1,2,3)

Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Muller, R. Manmatha, Mu Li and Alex Smola “ResNeSt: Split-Attention Network” arXiv preprint (2020).