{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "05. Deep dive into SSD training: 3 tips to boost performance\n===============================================================\n\nIn the previous tutorial sphx_glr_build_examples_detection_train_ssd_voc.py,\nwe briefly went through the basic APIs that help building the training pipeline of SSD.\n\nIn this article, we will dive deep into the details and introduce tricks that\nimportant for reproducing state-of-the-art performance.\nThese are the hidden pitfalls that are usually missing in papers and tech reports.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Loss normalization: use batch-wise norm instead of sample-wise norm\n-------------------------------------------------------------------\nThe training objective mentioned in paper is a weighted summation of localization\nloss(loc) and the confidence loss(conf).\n\n\\begin{align}L(x, c, l, g) = \\frac{1}{N} (L_{conf}(x, c) + \\alpha L_{loc}(x, l, g))\\end{align}\n\nBut the question is, what is the proper way to calculate N? Should we sum up\nN across the entire batch, or use per-sample N instead?\n\nTo illustrate this, please generate some dummy data:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import mxnet as mx\nx = mx.random.uniform(shape=(2, 3, 300, 300)) # use batch-size 2\n# suppose image 1 has single object\nid1 = mx.nd.array()\nbbox1 = mx.nd.array([[10, 20, 80, 90]]) # xmin, ymin, xmax, ymax\n# suppose image 2 has 4 objects\nid2 = mx.nd.array([1, 3, 5, 7])\nbbox2 = mx.nd.array([[10, 10, 30, 30], [40, 40, 60, 60], [50, 50, 90, 90], [100, 110, 120, 140]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, combine them into a batch by padding -1 as sentinal values:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gt_ids = mx.nd.ones(shape=(2, 4)) * -1\ngt_ids[0, :1] = id1\ngt_ids[1, :4] = id2\nprint('class_ids:', gt_ids)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gt_boxes = mx.nd.ones(shape=(2, 4, 4)) * -1\ngt_boxes[0, :1, :] = bbox1\ngt_boxes[1, :, :] = bbox2\nprint('bounding boxes:', gt_boxes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use a vgg16 atrous 300x300 SSD model in this example. For demo purpose, we\ndon't use any pretrained weights here\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from gluoncv import model_zoo\nnet = model_zoo.get_model('ssd_300_vgg16_atrous_voc', pretrained_base=False, pretrained=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some preparation before training\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from mxnet import gluon\nnet.initialize()\nconf_loss = gluon.loss.SoftmaxCrossEntropyLoss()\nloc_loss = gluon.loss.HuberLoss()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Simulate the training steps by manually compute losses:\nYou can always use gluoncv.loss.SSDMultiBoxLoss which fulfills this function.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from mxnet import autograd\nfrom gluoncv.model_zoo.ssd.target import SSDTargetGenerator\ntarget_generator = SSDTargetGenerator()\nwith autograd.record():\n # 1. forward pass\n cls_preds, box_preds, anchors = net(x)\n # 2. generate training targets\n cls_targets, box_targets, box_masks = target_generator(\n anchors, cls_preds, gt_boxes, gt_ids)\n num_positive = (cls_targets > 0).sum().asscalar()\n cls_mask = (cls_targets >= 0).expand_dims(axis=-1) # negative targets should be ignored in loss\n # 3 losses, here we have two options, batch-wise or sample wise norm\n # 3.1 batch wise normalization: divide loss by the summation of num positive targets in batch\n batch_conf_loss = conf_loss(cls_preds, cls_targets, cls_mask) / num_positive\n batch_loc_loss = loc_loss(box_preds, box_targets, box_masks) / num_positive\n # 3.2 sample wise normalization: divide by num positive targets in this sample(image)\n sample_num_positive = (cls_targets > 0).sum(axis=0, exclude=True)\n sample_conf_loss = conf_loss(cls_preds, cls_targets, cls_mask) / sample_num_positive\n sample_loc_loss = loc_loss(box_preds, box_targets, box_masks) / sample_num_positive\n # Since conf_loss and loc_loss calculate the mean of such loss, we want\n # to rescale it back to loss per image.\n rescale_conf = cls_preds.size / cls_preds.shape\n rescale_loc = box_preds.size / box_preds.shape\n # then call backward and step, to update the weights, etc..\n # L = conf_loss + loc_loss * alpha\n # L.backward()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The norms are different, but sample-wise norms sum up to be the same with\nbatch-wise norm\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print('batch-wise num_positive:', num_positive)\nprint('sample-wise num_positive:', sample_num_positive)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

#### Note

The per image num_positive is no longer 1 and 4 because multiple anchor\n boxes can be matched to a single object

\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare the losses\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print('batch-wise norm conf loss:', batch_conf_loss * rescale_conf)\nprint('sample-wise norm conf loss:', sample_conf_loss * rescale_conf)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print('batch-wise norm loc loss:', batch_loc_loss * rescale_loc)\nprint('sample-wise norm loc loss:', sample_loc_loss * rescale_loc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which one is better?\nAt first glance, it is hard to say which one is theoretically better\nbecause batch-wise norm ensures loss is well normalized by global statistics\nwhile sample-wise norm ensures gradients won't explode in some extreme cases where\nthere are hundreds of objects in a single image.\nIn such case it would cause other samples in the same\nbatch to be suppressed by this unusually large norm.\n\nIn our experiments, batch-wise norm is always better on Pascal VOC dataset,\ncontributing 1~2% mAP gain. However, you should definitely try both of them\nwhen you use a new dataset or a new model.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Initializer matters: don't stick to one single initializer\n--------------------------------------------------------\nWhile SSD networks are based on pre-trained feature extractors (called the base_network),\nwe also append uninitialized convolutional layers to the base_network\nto extend the cascades of feature maps.\n\nThere are also convolutional\npredictors appended to each output feature map, serving as class predictors and bounding\nbox offsets predictors.\n\nFor these added layers, we must initialize them before training.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from gluoncv import model_zoo\nimport mxnet as mx\n# don't load pretrained for this demo\nnet = model_zoo.get_model('ssd_300_vgg16_atrous_voc', pretrained=False, pretrained_base=False)\n# random init\nnet.initialize()\n# gluon only infer shape when real input data is used\nnet(mx.nd.zeros(shape=(1, 3, 300, 300)))\n# now we have real shape for each parameter\npredictors = [(k, v) for k, v in net.collect_params().items() if 'predictor' in k]\nname, pred = predictors\nprint(name, pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "we can initialize it with different initializers, such as Normal or Xavier.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pred.initialize(mx.init.Uniform(), force_reinit=True)\nprint('param shape:', pred.data().shape, 'peek first 20 elem:', pred.data().reshape((-1))[:20])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Simply switching from Uniform to Xavier can produce ~1% mAP gain.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pred.initialize(mx.init.Xavier(rnd_type='gaussian', magnitude=2, factor_type='out'), force_reinit=True)\nprint('param shape:', pred.data().shape, 'peek first 20 elem:', pred.data().reshape((-1))[:20])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interpreting confidence scores: process each class separately\n-----------------------------------------------------------\nIf we revisit the per-class confidence predictions, its shape is (B, A, N+1),\nwhere B is the batch size, A is the number of anchor boxes,\nN is the number of foreground classes.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print('class prediction shape:', cls_preds.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are two ways we can handle the prediction:\n\n1. take argmax of the prediction along the class axis. This way, only the\nthe most probable class is considered.\n\n2. process N foreground classes separately. This way, the second most\nprobable class, for example, still has a\nchance of surviving as the final prediction.\n\nConsider this example:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "cls_pred = mx.nd.array([-1, -2, 3, 4, 6.5, 6.4])\ncls_prob = mx.nd.softmax(cls_pred, axis=-1)\nfor k, v in zip(['bg', 'apple', 'orange', 'person', 'dog', 'cat'], cls_prob.asnumpy().tolist()):\n print(k, v)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The probabilities of dog and cat are so close that if we use method 1,\nwe are quite likely to lose the bet when cat is the correct decision.\n\nIt turns out that by switching from method 1 to method 2, we gain 0.5~0.8 mAP in evaluation.\n\nOne obvious drawback of method 2 is that it is significantly slower than method 1.\nFor N classes, method 2 has O(N) complexity while method 1 is always O(1).\nThis may or may not be a problem depending on the use case, but feel free to switch between them if you want.\n\n.. hint::\n Checkout :py:meth:gluoncv.nn.coder.MultiClassDecoder and\n :py:meth:gluoncv.nn.coder.MultiPerClassDecoder for implementations of method 1 and 2, respectively.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 0 }