{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. Dive deep into Training a Simple Pose Model on COCO Keypoints\n\nIn this tutorial, we show you how to train a pose estimation model [1]_ on the COCO dataset.\n\nFirst let's import some necessary modules.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from __future__ import division\n\nimport time, logging, os, math\n\nimport numpy as np\nimport mxnet as mx\nfrom mxnet import gluon, nd\nfrom mxnet import autograd as ag\nfrom mxnet.gluon import nn\nfrom mxnet.gluon.data.vision import transforms\n\nfrom gluoncv.data import mscoco\nfrom gluoncv.model_zoo import get_model\nfrom gluoncv.utils import makedirs, LRScheduler\nfrom gluoncv.data.transforms.presets.simple_pose import SimplePoseDefaultTrainTransform\nfrom gluoncv.utils.metrics import HeatmapAccuracy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the data\n\nWe can load COCO Keypoints dataset with their official API\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "train_dataset = mscoco.keypoints.COCOKeyPoints('~/.mxnet/datasets/coco',\n splits=('person_keypoints_train2017'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset object enables us to retrieve images containing a person,\nthe person's keypoints, and meta-information.\n\nFollowing the original paper, we resize the input to be ``(256, 192)``.\nFor augmentation, we randomly scale, rotate or flip the input.\nFinally we normalize it with the standard ImageNet statistics.\n\nThe COCO keypoints dataset contains 17 keypoints for a person.\nEach keypoint is annotated with three numbers ``(x, y, v)``, where ``x`` and ``y``\nmark the coordinates, and ``v`` indicates if the keypoint is visible.\n\nFor each keypoint, we generate a gaussian kernel centered at the ``(x, y)`` coordinate, and use\nit as the training label. This means the model predicts a gaussian distribution on a feature map.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "transform_train = SimplePoseDefaultTrainTransform(num_joints=train_dataset.num_joints,\n joint_pairs=train_dataset.joint_pairs,\n image_size=(256, 192), heatmap_size=(64, 48),\n scale_factor=0.30, rotation_factor=40, random_flip=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can define our data loader with the dataset and transformation. We will iterate\nover ``train_data`` in our training loop.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "batch_size = 32\ntrain_data = gluon.data.DataLoader(\n train_dataset.transform(transform_train),\n batch_size=batch_size, shuffle=True, last_batch='discard', num_workers=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deconvolution Layer\n\nA deconvolution layer enlarges the feature map size of the input,\nso that it can be seen as a layer upsamling the input feature map.\n\n\n\nIn the above image, the blue map is the input feature map, and the cyan map is the output.\n\nIn a ``ResNet`` model, the last feature map shrinks its height and width to be only 1/32 of the input. It may\nbe too small for a heatmap prediction. However if followed by several deconvolution layers, the feature map\ncan have a larger size thus easier to make the prediction.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Definition\n\nA Simple Pose model consists of a main body of a resnet, and several deconvolution layers.\nIts final layer is a convolution layer predicting one heatmap for each keypoint.\n\nLet's take a look at the smallest one from the GluonCV Model Zoo, using ``ResNet18`` as its base model.\n\nWe load the pre-trained parameters for the ``ResNet18`` layers,\nand initialize the deconvolution layer and the final convolution layer.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "context = mx.gpu(0)\nnet = get_model('simple_pose_resnet18_v1b', num_joints=17, pretrained_base=True,\n ctx=context, pretrained_ctx=context)\nnet.deconv_layers.initialize(ctx=context)\nnet.final_layer.initialize(ctx=context)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can take a look at the summary of the model\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "x = mx.nd.ones((1, 3, 256, 192), ctx=context)\nnet.summary(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
The Batch Normalization implementation from cuDNN has a negative impact on the model training,\n as reported in these issues [2]_, [3]_ .\n\n Since similar behavior is observed, we implement a ``BatchNormCudnnOff`` layer as a temporary solution.\n This layer doesn't call the Batch Normalization layer from cuDNN, thus gives better results.