{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 03. Monodepth2 training on KITTI dataset\n\nThis is a tutorial of training MonoDepth2 on the KITTI dataset using Gluon CV toolkit.\nThe readers should have basic knowledge of deep learning and should be familiar with Gluon API.\nNew users may first go through `A 60-minute Gluon Crash Course `_.\nYou can `Start Training Now`_ or `Dive into Deep`_.\n\n## Start Training Now\n\n.. hint::\n\n Feel free to skip the tutorial because the training script is self-complete and ready to launch.\n\n :download:`Download Full Python Script: train.py<../../../scripts/depth/train.py>`\n\n :download:`Download Full Python Script: trainer.py<../../../scripts/depth/trainer.py>`\n\n mono+stereo mode training command::\n\n python train.py --model_zoo monodepth2_resnet18_kitti_mono_stereo_640x192 --model_zoo_pose monodepth2_resnet18_posenet_kitti_mono_stereo_640x192 --pretrained_base --frame_ids 0 -1 1 --use_stereo --log_dir ./tmp/mono_stereo/ --png --gpu 0 --batch_size 8\n\n mono mode training command::\n\n python train.py --model_zoo monodepth2_resnet18_kitti_mono_640x192 --model_zoo_pose monodepth2_resnet18_posenet_kitti_mono_640x192 --pretrained_base --log_dir ./tmp/mono/ --png --gpu 0 --batch_size 12\n\n stereo mode training command::\n\n python train.py --model_zoo monodepth2_resnet18_kitti_stereo_640x192 --pretrained_base --split eigen_full --frame_ids 0 --use_stereo --log_dir ./tmp/stereo/ --png --gpu 0 --batch_size 12\n\n For more training command options, please run ``python train.py -h``\n Please checkout the `model_zoo <../../model_zoo/depth.html>`_ for training commands of reproducing the pretrained model.\n\n## Dive into Deep\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\nimport mxnet as mx\nfrom mxnet import gluon, autograd\nimport gluoncv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Digging into Self-Supervised Monocular Depth Prediction\n\n\n\n(figure credit to `Godard et al. `_ )\n\nSelf-Supervised Monocular Depth Estimation (Monodepth2) [Godard19]_ builds a\nsimple depth model and train it with a self-supervised manner by exploiting the\nspatial geometry constrain. The key idea of Monodepth2 is that it builds a novel\nreprojection loss, include (1) a minimum reprojection loss, designed to robustly\nhandle occlusions, (2) a full-resolution multi-scale sampling method that reduces\nvisual artifacts, and (3) an auto-masking loss to ignore training pixels that violate\ncamera motion assumptions.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Monodepth2 Model\n\nA simple U-Net architecture is used in Monodepth2, which combines multiple scale\nfeatures with different receptive field sizes. It pools the feature maps into different sizes\nand then concatenating together after upsampling. There are two decoders for depth estimation and\ncamera pose estimation.\n\nThe encoder module is a ResNet, it accepts single RGB images as input for the depth model.\nFor the pose model, The pose encoder is modified to accept a pair of frames, or six channels, as input.\nTherefore, the pose encoder has convolutional weights in the first layer of shape 6\u00d764\u00d73\u00d73,\ninstead of the ResNet default of 3\u00d764\u00d73\u00d73. When using pre-trained weights for the pose encoder,\nthe first pre-trained filter tensor is duplicated along the channel dimension to make a filter of\nshape 6 \u00d7 64 \u00d7 3 \u00d7 3. All weights in this new expanded filter are divided by 2 to make the output of the convolution\nin the same numerical range as the original, one-image ResNet.\n\nThe encoder is defined as::\n\n class ResnetEncoder(nn.HybridBlock):\n def __init__(self, backbone, pretrained, num_input_images=1,\n root=os.path.join(os.path.expanduser('~'), '.mxnet/models'),\n ctx=cpu(), **kwargs):\n super(ResnetEncoder, self).__init__()\n\n self.num_ch_enc = np.array([64, 64, 128, 256, 512])\n\n resnets = {'resnet18': resnet18_v1b,\n 'resnet34': resnet34_v1b,\n 'resnet50': resnet50_v1s,\n 'resnet101': resnet101_v1s,\n 'resnet152': resnet152_v1s}\n\n num_layers = {'resnet18': 18,\n 'resnet34': 34,\n 'resnet50': 50,\n 'resnet101': 101,\n 'resnet152': 152}\n\n if backbone not in resnets:\n raise ValueError(\"{} is not a valid resnet\".format(backbone))\n\n if num_input_images > 1:\n self.encoder = resnets[backbone](pretrained=False, ctx=ctx, **kwargs)\n if pretrained:\n filename = os.path.join(\n root, 'resnet%d_v%db_multiple_inputs.params' % (num_layers[backbone], 1))\n if not os.path.isfile(filename):\n from ..model_store import get_model_file\n loaded = mx.nd.load(get_model_file('resnet%d_v%db' % (num_layers[backbone], 1),\n tag=pretrained, root=root))\n loaded['conv1.weight'] = mx.nd.concat(\n *([loaded['conv1.weight']] * num_input_images), dim=1) / num_input_images\n mx.nd.save(filename, loaded)\n self.encoder.load_parameters(filename, ctx=ctx)\n from ...data import ImageNet1kAttr\n attrib = ImageNet1kAttr()\n self.encoder.synset = attrib.synset\n self.encoder.classes = attrib.classes\n self.encoder.classes_long = attrib.classes_long\n else:\n self.encoder = resnets[backbone](pretrained=pretrained, ctx=ctx, **kwargs)\n\n if backbone not in ('resnet18', 'resnet34'):\n self.num_ch_enc[1:] *= 4\n\n def hybrid_forward(self, F, input_image):\n self.features = []\n x = (input_image - 0.45) / 0.225\n x = self.encoder.conv1(x)\n x = self.encoder.bn1(x)\n self.features.append(self.encoder.relu(x))\n self.features.append(self.encoder.layer1(self.encoder.maxpool(self.features[-1])))\n self.features.append(self.encoder.layer2(self.features[-1]))\n self.features.append(self.encoder.layer3(self.features[-1]))\n self.features.append(self.encoder.layer4(self.features[-1]))\n\n return self.features\n\n\nThe Decoder module is a fully convolutional network with skip architecture, it exploits the feature maps\nin a different scale and concatenating together after upsampling. A sigmoid activation at the last layer.\nIt bound the output to [0, 1], which means that the depth decoder outputs a normalized disparity map.\n\nIt is defined as::\n\n class DepthDecoder(nn.HybridBlock):\n def __init__(self, num_ch_enc, scales=range(4), num_output_channels=1,\n use_skips=True):\n super(DepthDecoder, self).__init__()\n\n self.num_output_channels = num_output_channels\n self.use_skips = use_skips\n self.upsample_mode = 'nearest'\n self.scales = scales\n\n self.num_ch_enc = num_ch_enc\n self.num_ch_dec = np.array([16, 32, 64, 128, 256])\n\n # decoder\n with self.name_scope():\n self.convs = OrderedDict()\n for i in range(4, -1, -1):\n # upconv_0\n num_ch_in = self.num_ch_enc[-1] if i == 4 else self.num_ch_dec[i + 1]\n num_ch_out = self.num_ch_dec[i]\n self.convs[(\"upconv\", i, 0)] = ConvBlock(num_ch_in, num_ch_out)\n\n # upconv_1\n num_ch_in = self.num_ch_dec[i]\n if self.use_skips and i > 0:\n num_ch_in += self.num_ch_enc[i - 1]\n num_ch_out = self.num_ch_dec[i]\n self.convs[(\"upconv\", i, 1)] = ConvBlock(num_ch_in, num_ch_out)\n\n for s in self.scales:\n self.convs[(\"dispconv\", s)] = Conv3x3(\n self.num_ch_dec[s], self.num_output_channels)\n\n # register blocks\n for k in self.convs:\n self.register_child(self.convs[k])\n self.decoder = nn.HybridSequential()\n self.decoder.add(*list(self.convs.values()))\n\n self.sigmoid = nn.Activation('sigmoid')\n\n def hybrid_forward(self, F, input_features):\n self.outputs = []\n\n # decoder\n x = input_features[-1]\n for i in range(4, -1, -1):\n x = self.convs[(\"upconv\", i, 0)](x)\n x = [F.UpSampling(x, scale=2, sample_type='nearest')]\n if self.use_skips and i > 0:\n x += [input_features[i - 1]]\n x = F.concat(*x, dim=1)\n x = self.convs[(\"upconv\", i, 1)](x)\n if i in self.scales:\n self.outputs.append(self.sigmoid(self.convs[(\"dispconv\", i)](x)))\n\n return self.outputs\n\nThe PoseNet Decoder module is a fully convolutional network and it predicts the rotation\nusing an axis-angle representation and scale the rotation and translation outputs by 0.01.\n\nIt is defined as::\n\n class PoseDecoder(nn.HybridBlock):\n def __init__(self, num_ch_enc, num_input_features, num_frames_to_predict_for=2, stride=1):\n super(PoseDecoder, self).__init__()\n\n self.num_ch_enc = num_ch_enc\n self.num_input_features = num_input_features\n\n if num_frames_to_predict_for is None:\n num_frames_to_predict_for = num_input_features - 1\n self.num_frames_to_predict_for = num_frames_to_predict_for\n\n self.convs = OrderedDict()\n self.convs[(\"squeeze\")] = nn.Conv2D(\n in_channels=self.num_ch_enc[-1], channels=256, kernel_size=1)\n self.convs[(\"pose\", 0)] = nn.Conv2D(\n in_channels=num_input_features * 256, channels=256,\n kernel_size=3, strides=stride, padding=1)\n self.convs[(\"pose\", 1)] = nn.Conv2D(\n in_channels=256, channels=256, kernel_size=3, strides=stride, padding=1)\n self.convs[(\"pose\", 2)] = nn.Conv2D(\n in_channels=256, channels=6 * num_frames_to_predict_for, kernel_size=1)\n\n # register blocks\n for k in self.convs:\n self.register_child(self.convs[k])\n self.net = nn.HybridSequential()\n self.net.add(*list(self.convs.values()))\n\n def hybrid_forward(self, F, input_features):\n last_features = [f[-1] for f in input_features]\n\n cat_features = [F.relu(self.convs[\"squeeze\"](f)) for f in last_features]\n cat_features = F.concat(*cat_features, dim=1)\n\n out = cat_features\n for i in range(3):\n out = self.convs[(\"pose\", i)](out)\n if i != 2:\n out = F.relu(out)\n\n out = out.mean(3).mean(2)\n\n out = 0.01 * out.reshape(-1, self.num_frames_to_predict_for, 1, 6)\n\n axisangle = out[..., :3]\n translation = out[..., 3:]\n\n return axisangle, translation\n\nMonodepth model is provided in :class:`gluoncv.model_zoo.MonoDepth2` and PoseNet is provide\nin :class:`gluoncv.model_zoo.MonoDepth2PoseNet`. To get Monodepth2 model using ResNet18 base network:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"model = gluoncv.model_zoo.get_monodepth2(backbone='resnet18')\nprint(model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To get PoseNet using ResNet18 base network:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"posenet = gluoncv.model_zoo.get_monodepth2posenet(backbone='resnet18')\nprint(posenet)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dataset and Data Augmentation\n\n- Prepare KITTI RAW Dataset:\n\n Here we give an example of training monodepth2 on the KITTI RAW dataset [Godard19]_. First,\n we need to prepare the dataset. The official implementation of monodepth2 does not use all\n the data of the KITTI RAW dataset, here we use the same dataset and split method as [Godard19]_.\n You need download the split zip file, and extract it to ``$(HOME)/.mxnet/datasets/kitti/``.\n\n\n Follow the command to get the dataset::\n\n cd ~\n mkdir -p .mxnet/datasets/kitti\n cd .mxnet/datasets/kitti\n wget https://github.com/KuangHaofei/GluonCV_Test/raw/master/monodepthv2/tutorials/splits.zip\n unzip splits.zip\n wget -i splits/kitti_archives_to_download.txt -P kitti_data/\n cd kitti_data\n unzip \"*.zip\"\n\n .. hint::\n\n You need 175GB, free disk space to download and extract this dataset. SSD harddrives are recommended\n for faster speed. The time it takes to prepare the dataset depends on your Internet connection and\n disk speed. For example, it takes around 2 hours on an AWS EC2 instance with EBS.\n\nWe provide self-supervised depth estimation datasets in :class:`gluoncv.data`.\n\nFor example, we can easily get the KITTI RAW Stereo dataset::\n\n import os\n from gluoncv.data.kitti import readlines, dict_batchify_fn\n\n train_filenames = os.path.join(\n os.path.expanduser(\"~\"), '.mxnet/datasets/kitti/splits/eigen_full/train_files.txt')\n train_filenames = readlines(train_filenames)\n train_dataset = gluoncv.data.KITTIRAWDataset(\n filenames=train_filenames, height=192, width=640,\n frame_idxs=[0, -1, 1, \"s\"], num_scales=4, is_train=True, img_ext='.png')\n print('Training images:', len(train_dataset))\n # set batch_size = 12 for toy example\n batch_size = 12\n train_loader = gluon.data.DataLoader(\n train_dataset, batch_size=batch_size, shuffle=True, batchify_fn=dict_batchify_fn,\n num_workers=12, pin_memory=True, last_batch='discard')\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, the ``frame_idxs`` argument is used to decide the input frame. It is a list and the first element\nmust be 0 means source frame. Other elements mean target frames. Numerical values represent relative frame id in\nimage sequences. \"s\" means another side of the source image upon stereo pairs.\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Data Augmentation\n\n We follow the standard data augmentation routine to transform the input image.\n Here, we just use RandomFlip with 50% probability for input images.\n\nRandom pick one example for visualization::\n\n import random\n from datetime import datetime\n random.seed(datetime.now())\n idx = random.randint(0, len(train_dataset))\n\n data = train_dataset[idx]\n input_img = data[(\"color\", 0, 0)]\n input_stereo_img = data[(\"color\", 's', 0)]\n input_gt = data['depth_gt']\n\n input_img = np.transpose((input_img.asnumpy() * 255).astype(np.uint8), (1, 2, 0))\n input_stereo_img = np.transpose((input_stereo_img.asnumpy() * 255).astype(np.uint8), (1, 2, 0))\n input_gt = np.transpose((input_gt.asnumpy()).astype(np.uint8), (1, 2, 0))\n\n from PIL import Image\n input_img = Image.fromarray(input_img)\n input_stereo_img = Image.fromarray(input_stereo_img)\n input_gt = Image.fromarray(input_gt[:, :, 0])\n\n input_img.save(\"input_img.png\")\n input_stereo_img.save(\"input_stereo_img.png\")\n input_gt.save(\"input_gt.png\")\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot the stereo image pairs and ground truth of the left image::\n\n from matplotlib import pyplot as plt\n\n input_img = Image.open('input_img.png').convert('RGB')\n input_stereo_img = Image.open('input_stereo_img.png').convert('RGB')\n input_gt = Image.open('input_gt.png')\n\n fig = plt.figure()\n # subplot 1 for left image\n plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=0.75)\n fig.add_subplot(3, 1, 1)\n plt.title(\"left image\")\n plt.imshow(input_img)\n # subplot 2 for right images\n fig.add_subplot(3, 1, 2)\n plt.title(\"right image\")\n plt.imshow(input_stereo_img)\n # subplot 3 for the ground truth\n fig.add_subplot(3, 1, 3)\n plt.title(\"ground truth of left input (the reprojection of LiDAR data)\")\n plt.imshow(input_gt)\n # display\n plt.show()\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Dataloader will provide a dictionary which includes raw images, augmented images, camera intrinsics,\ncamera extrinsic (stereo), and ground truth depth maps (for validation).\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training Details\n- Predict Camera Pose:\n\n When training network with mono or mono+stereo mode, we have to get the predicted camera pose through PoseNet.\n\nThe prediction of loss is defined as\n(Please check out the full :download:`trainer.py<../../../scripts/depth/trainer.py>` for complete implementation.)::\n\n def predict_poses(self, inputs):\n outputs = {}\n\n pose_feats = {f_i: inputs[\"color_aug\", f_i, 0] for f_i in self.opt.frame_ids}\n\n for f_i in self.opt.frame_ids[1:]:\n if f_i != \"s\":\n # To maintain ordering we always pass frames in temporal order\n if f_i < 0:\n pose_inputs = [pose_feats[f_i], pose_feats[0]]\n else:\n pose_inputs = [pose_feats[0], pose_feats[f_i]]\n\n axisangle, translation = self.posenet(mx.nd.concat(*pose_inputs, dim=1))\n outputs[(\"axisangle\", 0, f_i)] = axisangle\n outputs[(\"translation\", 0, f_i)] = translation\n\n # Invert the matrix if the frame id is negative\n outputs[(\"cam_T_cam\", 0, f_i)] = transformation_from_parameters(\n axisangle[:, 0], translation[:, 0], invert=(f_i < 0))\n\n return outputs\n\n- Image Reconstruction:\n\n For training the network via self-supervised manner, we have to reconstruct a source image from target image\n according to predicted depth and pose (or use camera extrinsic of stereo pairs). Then, calculating reprojection\n photometric loss between the reconstructed source image with the real source image.\n\n\nThe whole process is divided into three steps,\n\n1. To back project each point of the target image to 3D space according to depth and camera intrinsic;\n\n2. To project 3D points to image plane according to camera extrinsic (pose) and intrinsic;\n\n3. Sampling pixels from the source image to reconstruct a new image according to the projected points (exploit Spatial Transformer Networks (STN) to ensure that the sampling is differentiable).\n\n\nBack projection (2D to 3D) is defined as::\n\n class BackprojectDepth(nn.HybridBlock):\n \"\"\"Layer to transform a depth image into a point cloud\n \"\"\"\n\n def __init__(self, batch_size, height, width, ctx=mx.cpu()):\n super(BackprojectDepth, self).__init__()\n\n self.batch_size = batch_size\n self.height = height\n self.width = width\n\n self.ctx = ctx\n\n meshgrid = np.meshgrid(range(self.width), range(self.height), indexing='xy')\n id_coords = np.stack(meshgrid, axis=0).astype(np.float32)\n id_coords = mx.nd.array(id_coords).as_in_context(self.ctx)\n\n pix_coords = mx.nd.expand_dims(mx.nd.stack(*[id_coords[0].reshape(-1),\n id_coords[1].reshape(-1)], axis=0),\n axis=0)\n pix_coords = pix_coords.repeat(repeats=batch_size, axis=0)\n pix_coords = pix_coords.as_in_context(self.ctx)\n\n with self.name_scope():\n self.id_coords = self.params.get('id_coords', shape=id_coords.shape,\n init=mx.init.Zero(), grad_req='null')\n self.id_coords.initialize(ctx=self.ctx)\n self.id_coords.set_data(mx.nd.array(id_coords))\n\n self.ones = self.params.get('ones',\n shape=(self.batch_size, 1, self.height * self.width),\n init=mx.init.One(), grad_req='null')\n self.ones.initialize(ctx=self.ctx)\n\n self.pix_coords = self.params.get('pix_coords',\n shape=(self.batch_size, 3, self.height * self.width),\n init=mx.init.Zero(), grad_req='null')\n self.pix_coords.initialize(ctx=self.ctx)\n self.pix_coords.set_data(mx.nd.concat(pix_coords, self.ones.data(), dim=1))\n\n def hybrid_forward(self, F, depth, inv_K, **kwargs):\n cam_points = F.batch_dot(inv_K[:, :3, :3], self.pix_coords.data())\n cam_points = depth.reshape(self.batch_size, 1, -1) * cam_points\n cam_points = F.concat(cam_points, self.ones.data(), dim=1)\n\n return cam_points\n\n\nProjection (3D to 2D) is defined as::\n\n class Project3D(nn.HybridBlock):\n \"\"\"Layer which projects 3D points into a camera with intrinsics K and at position T\n \"\"\"\n\n def __init__(self, batch_size, height, width, eps=1e-7):\n super(Project3D, self).__init__()\n\n self.batch_size = batch_size\n self.height = height\n self.width = width\n self.eps = eps\n\n def hybrid_forward(self, F, points, K, T):\n P = F.batch_dot(K, T)[:, :3, :]\n\n cam_points = F.batch_dot(P, points)\n\n cam_pix = cam_points[:, :2, :] / (cam_points[:, 2, :].expand_dims(1) + self.eps)\n cam_pix = cam_pix.reshape(self.batch_size, 2, self.height, self.width)\n\n x_src = cam_pix[:, 0, :, :] / (self.width - 1)\n y_src = cam_pix[:, 1, :, :] / (self.height - 1)\n pix_coords = F.concat(x_src.expand_dims(1), y_src.expand_dims(1), dim=1)\n pix_coords = (pix_coords - 0.5) * 2\n\n return pix_coords\n\n\nThe image reconstruction function is defined as\n(Please check out the full :download:`trainer.py<../../../scripts/depth/trainer.py>` for complete implementation.)::\n\n def generate_images_pred(self, inputs, outputs):\n for scale in self.opt.scales:\n disp = outputs[(\"disp\", scale)]\n if self.opt.v1_multiscale:\n source_scale = scale\n else:\n disp = mx.nd.contrib.BilinearResize2D(disp,\n height=self.opt.height,\n width=self.opt.width)\n source_scale = 0\n\n _, depth = disp_to_depth(disp, self.opt.min_depth, self.opt.max_depth)\n outputs[(\"depth\", 0, scale)] = depth\n\n for i, frame_id in enumerate(self.opt.frame_ids[1:]):\n\n if frame_id == \"s\":\n T = inputs[\"stereo_T\"]\n else:\n T = outputs[(\"cam_T_cam\", 0, frame_id)]\n\n cam_points = self.backproject_depth[source_scale](depth,\n inputs[(\"inv_K\", source_scale)])\n pix_coords = self.project_3d[source_scale](cam_points,\n inputs[(\"K\", source_scale)],\n T)\n\n outputs[(\"sample\", frame_id, scale)] = pix_coords\n\n outputs[(\"color\", frame_id, scale)] = mx.nd.BilinearSampler(\n data=inputs[(\"color\", frame_id, source_scale)],\n grid=outputs[(\"sample\", frame_id, scale)],\n name='sampler')\n\n if not self.opt.disable_automasking:\n outputs[(\"color_identity\", frame_id, scale)] = \\\n inputs[(\"color\", frame_id, source_scale)]\n\n\n- Training Losses:\n\n We apply a standard reprojection loss to train Monodepth2.\n As describes in Monodepth2 [Godard19]_ , the reprojection loss includes three parts:\n a multi-scale reprojection photometric loss (combined L1 loss and SSIM loss), an auto-masking loss and\n an edge-aware smoothness loss as in Monodepth [Godard17]_ .\n\nThe computation of loss is defined as\n(Please checkout the full :download:`trainer.py<../../../scripts/depth/trainer.py>` for complete implementation.)::\n\n def compute_losses(self, inputs, outputs):\n \"\"\"Compute the reprojection and smoothness losses for a minibatch\n \"\"\"\n losses = {}\n total_loss = 0\n\n for scale in self.opt.scales:\n loss = 0\n reprojection_losses = []\n\n if self.opt.v1_multiscale:\n source_scale = scale\n else:\n source_scale = 0\n\n disp = outputs[(\"disp\", scale)]\n color = inputs[(\"color\", 0, scale)]\n target = inputs[(\"color\", 0, source_scale)]\n\n for frame_id in self.opt.frame_ids[1:]:\n pred = outputs[(\"color\", frame_id, scale)]\n reprojection_losses.append(self.compute_reprojection_loss(pred, target))\n\n reprojection_losses = mx.nd.concat(*reprojection_losses, dim=1)\n\n if not self.opt.disable_automasking:\n identity_reprojection_losses = []\n for frame_id in self.opt.frame_ids[1:]:\n pred = inputs[(\"color\", frame_id, source_scale)]\n identity_reprojection_losses.append(\n self.compute_reprojection_loss(pred, target))\n\n identity_reprojection_losses = mx.nd.concat(*identity_reprojection_losses, dim=1)\n\n if self.opt.avg_reprojection:\n identity_reprojection_loss = \\\n identity_reprojection_losses.mean(axis=1, keepdims=True)\n else:\n # save both images, and do min all at once below\n identity_reprojection_loss = identity_reprojection_losses\n\n if self.opt.avg_reprojection:\n reprojection_loss = reprojection_losses.mean(axis=1, keepdims=True)\n else:\n reprojection_loss = reprojection_losses\n\n if not self.opt.disable_automasking:\n # add random numbers to break ties\n identity_reprojection_loss = \\\n identity_reprojection_loss + \\\n mx.nd.random.randn(*identity_reprojection_loss.shape).as_in_context(\n identity_reprojection_loss.context) * 0.00001\n\n combined = mx.nd.concat(identity_reprojection_loss, reprojection_loss, dim=1)\n else:\n combined = reprojection_loss\n\n if combined.shape[1] == 1:\n to_optimise = combined\n else:\n to_optimise = mx.nd.min(data=combined, axis=1)\n idxs = mx.nd.argmin(data=combined, axis=1)\n\n if not self.opt.disable_automasking:\n outputs[\"identity_selection/{}\".format(scale)] = (\n idxs > identity_reprojection_loss.shape[1] - 1).astype('float')\n\n loss += to_optimise.mean()\n\n mean_disp = disp.mean(axis=2, keepdims=True).mean(axis=3, keepdims=True)\n norm_disp = disp / (mean_disp + 1e-7)\n\n smooth_loss = get_smooth_loss(norm_disp, color)\n\n loss = loss + self.opt.disparity_smoothness * smooth_loss / (2 ** scale)\n total_loss = total_loss + loss\n losses[\"loss/{}\".format(scale)] = loss\n\n total_loss = total_loss / self.num_scales\n losses[\"loss\"] = total_loss\n return losses\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Learning Rate and Scheduling:\n\n Here, we follow the standard strategy of monodepth2. The network is trained for 20 epochs using Adam.\n We use a 'step' learning rate scheduler for Monodepth2 training, provided in :class:`gluoncv.utils.LRScheduler`.\n We use a learning rate of 10\u22124 for the first 15 epochs which is then dropped to 10\u22125 for the remainder.\n\nThe example of optimization is defined as::\n\n lr_scheduler = gluoncv.utils.LRSequential([\n gluoncv.utils.LRScheduler(\n 'step', base_lr=1e-4, nepochs=20, iters_per_epoch=len(train_dataset), step_epoch=[15])\n ])\n optimizer_params = {'lr_scheduler': lr_scheduler,\n 'learning_rate': 1e-4}\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Create Adam solver\n\nThe example for depth & pose optimizer are defined as::\n\n depth_optimizer = gluon.Trainer(model.collect_params(), 'adam', optimizer_params)\n pose_optimizer = gluon.Trainer(posenet.collect_params(), 'adam', optimizer_params)\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The training loop\n\nPlease checkout the full :download:`trainer.py<../../../scripts/depth/trainer.py>` for complete implementation.\nThis is an example of training loop::\n\n def train(self):\n \"\"\"Run the entire training pipeline\n \"\"\"\n self.logger.info('Starting Epoch: %d' % self.opt.start_epoch)\n self.logger.info('Total Epochs: %d' % self.opt.num_epochs)\n\n self.epoch = 0\n for self.epoch in range(self.opt.start_epoch, self.opt.num_epochs):\n self.run_epoch()\n self.val()\n\n # save final model\n self.save_model(\"final\")\n self.save_model(\"best\")\n\n\n def run_epoch(self):\n \"\"\"Run a single epoch of training and validation\n \"\"\"\n print(\"Training\")\n tbar = tqdm(self.train_loader)\n train_loss = 0.0\n for batch_idx, inputs in enumerate(tbar):\n with autograd.record(True):\n outputs, losses = self.process_batch(inputs)\n mx.nd.waitall()\n\n autograd.backward(losses['loss'])\n self.depth_optimizer.step(self.opt.batch_size, ignore_stale_grad=True)\n\n if self.use_pose_net:\n self.pose_optimizer.step(self.opt.batch_size, ignore_stale_grad=True)\n\n train_loss += losses['loss'].asscalar()\n tbar.set_description('Epoch %d, training loss %.3f' % \\\n (self.epoch, train_loss / (batch_idx + 1)))\n\n if batch_idx % self.opt.log_frequency == 0:\n self.logger.info('Epoch %d iteration %04d/%04d: training loss %.3f' %\n (self.epoch, batch_idx, len(self.train_loader),\n train_loss / (batch_idx + 1)))\n mx.nd.waitall()\n\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can `Start Training Now`_.\n\n### References\n.. [Godard17] Clement Godard, Oisin Mac Aodha and Gabriel J. Brostow \\\n \"Unsupervised Monocular Depth Estimation with Left-Right Consistency.\" \\\n Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2017.\n\n.. [Godard19] Clement Godard, Oisin Mac Aodha, Michael Firman and Gabriel Brostow. \\\n \"Digging Into Self-Supervised Monocular Depth Estimation.\" \\\n Proceedings of the IEEE conference on computer vision (ICCV). 2019.\n\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}