3. Getting Started with Pre-trained I3D Models on Kinetcis400

Kinetics400 is an action recognition dataset of realistic action videos, collected from YouTube. With 306,245 short trimmed videos from 400 action categories, it is one of the largest and most widely used dataset in the research community for benchmarking state-of-the-art video action recognition models.

I3D (Inflated 3D Networks) is a widely adopted 3D video classification network. It uses 3D convolution to learn spatiotemporal information directly from videos. I3D is proposed to improve C3D (Convolutional 3D Networks) by inflating from 2D models. We can not only reuse the 2D models’ architecture (e.g., ResNet, Inception), but also bootstrap the model weights from 2D pretrained models. In this manner, training 3D networks for video classification is feasible and getting much better results.

In this tutorial, we will demonstrate how to load a pre-trained I3D model from gluoncv-model-zoo and classify a video clip from the Internet or your local disk into one of the 400 action classes.

Step by Step

We will try out a pre-trained I3D model on a single video clip.

First, please follow the installation guide to install MXNet and GluonCV if you haven’t done so yet.

import matplotlib.pyplot as plt
import numpy as np
import mxnet as mx
from mxnet import gluon, nd, image
from mxnet.gluon.data.vision import transforms
from gluoncv.data.transforms import video
from gluoncv import utils
from gluoncv.model_zoo import get_model

Then, we download the video and extract a 32-frame clip from it.

from gluoncv.utils.filesystem import try_import_decord
decord = try_import_decord()

url = 'https://github.com/bryanyzhu/tiny-ucf101/raw/master/abseiling_k400.mp4'
video_fname = utils.download(url)
vr = decord.VideoReader(video_fname)
frame_id_list = range(0, 64, 2)
video_data = vr.get_batch(frame_id_list).asnumpy()
clip_input = [video_data[vid, :, :, :] for vid, _ in enumerate(frame_id_list)]

Now we define transformations for the video clip. This transformation function does three things: center crop the image to 224x224 in size, transpose it to num_channels*num_frames*height*width, and normalize with mean and standard deviation calculated across all ImageNet images.

transform_fn = video.VideoGroupValTransform(size=224, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
clip_input = transform_fn(clip_input)
clip_input = np.stack(clip_input, axis=0)
clip_input = clip_input.reshape((-1,) + (32, 3, 224, 224))
clip_input = np.transpose(clip_input, (0, 2, 1, 3, 4))
print('Video data is downloaded and preprocessed.')

Out:

Video data is downloaded and preprocessed.

Next, we load a pre-trained I3D model.

model_name = 'i3d_inceptionv1_kinetics400'
net = get_model(model_name, nclass=400, pretrained=True)
print('%s model is successfully loaded.' % model_name)

Out:

Downloading /root/.mxnet/models/i3d_inceptionv1_kinetics400-81e0be10.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/i3d_inceptionv1_kinetics400-81e0be10.zip...

  0%|          | 0/51277 [00:00<?, ?KB/s]
  0%|          | 100/51277 [00:00<01:02, 820.07KB/s]
  1%|          | 508/51277 [00:00<00:21, 2341.81KB/s]
  4%|4         | 2188/51277 [00:00<00:06, 7633.88KB/s]
 16%|#5        | 8084/51277 [00:00<00:01, 25659.59KB/s]
 28%|##8       | 14495/51277 [00:00<00:00, 38505.38KB/s]
 40%|####      | 20665/51277 [00:00<00:00, 45780.50KB/s]
 56%|#####6    | 28810/51277 [00:00<00:00, 57007.25KB/s]
 69%|######9   | 35557/51277 [00:00<00:00, 60251.00KB/s]
 86%|########6 | 44111/51277 [00:00<00:00, 67160.94KB/s]
51278KB [00:01, 48589.43KB/s]
i3d_inceptionv1_kinetics400 model is successfully loaded.

Note that if you want to use InceptionV3 series model (i.e., i3d_inceptionv3_kinetics400), please resize the image to have both dimensions larger than 299 (e.g., 340x450) and change input size from 224 to 299 in the transform function. Finally, we prepare the video clip and feed it to the model.

pred = net(nd.array(clip_input))

classes = net.classes
topK = 5
ind = nd.topk(pred, k=topK)[0].astype('int')
print('The input video clip is classified to be')
for i in range(topK):
    print('\t[%s], with probability %.3f.'%
          (classes[ind[i].asscalar()], nd.softmax(pred)[0][ind[i]].asscalar()))

Out:

The input video clip is classified to be
        [abseiling], with probability 0.991.
        [rock_climbing], with probability 0.009.
        [ice_climbing], with probability 0.000.
        [paragliding], with probability 0.000.
        [skydiving], with probability 0.000.

We can see that our pre-trained model predicts this video clip to be abseiling action with high confidence.

Next Step

If you would like to dive deeper into training I3D models on Kinetics400, feel free to read the next tutorial on Kinetics400.

Total running time of the script: ( 0 minutes 2.471 seconds)

Gallery generated by Sphinx-Gallery