Table Of Contents
Table Of Contents

Action Recognition

Table of pre-trained models for video action recognition and their performance.

Hint

Training commands work with this script: Download train_recognizer.py

A model can have differently trained parameters with different hashtags. Parameters with a grey name can be downloaded by passing the corresponding hashtag.

  • Download default pretrained weights: net = get_model('inceptionv3_ucf101', pretrained=True)

  • Download weights given a hashtag: net = get_model('inceptionv3_ucf101', pretrained='0c453da8')

The test script Download test_recognizer.py can be used for evaluating the models.

UCF101 Dataset

The following table lists pre-trained models trained on UCF101.

Note

Our pre-trained models reproduce results from “Temporal Segment Networks” 2 and “Inflated 3D Networks (I3D)” 3 . Please check the reference paper for further information.

The top-1 accuracy number shown below is for official split 1 of UCF101 dataset, not the average of 3 splits.

InceptionV3 is trained and evaluated with input size of 299x299.

K400 is Kinetics400 dataset, which means we use model pretrained on Kinetics400 as weights initialization.

Name

Pretrained

Segments

Clip Length

Top-1

Hashtag

Train Command

Train Log

vgg16_ucf101 2

ImageNet

3

1

83.4

d6dc1bba

shell script

log

vgg16_ucf101 1

ImageNet

1

1

81.5

05e319d4

shell script

log

inceptionv3_ucf101 2

ImageNet

3

1

88.1

13ef5c3b

shell script

log

inceptionv3_ucf101 1

ImageNet

1

1

85.6

0c453da8

shell script

log

i3d_resnet50_v1_ucf101 3

ImageNet

1

32 (64/2)

83.9

7afc7286

shell script

log

i3d_resnet50_v1_ucf101 3

ImageNet, K400

1

32 (64/2)

95.4

760d0981

shell script

log

HMDB51 Dataset

The following table lists pre-trained models trained on HMDB51.

Note

Our pre-trained models reproduce results from “Temporal Segment Networks” 2 and “Inflated 3D Networks (I3D)” 3 . Please check the reference paper for further information.

The top-1 accuracy number shown below is for official split 1 of HMDB51 dataset, not the average of 3 splits.

Name

Pretrained

Segments

Clip Length

Top-1

Hashtag

Train Command

Train Log

resnet50_v1b_hmdb51 2

ImageNet

3

1

55.2

682591e2

shell script

log

resnet50_v1b_hmdb51 1

ImageNet

1

1

52.2

ba66ee4b

shell script

log

i3d_resnet50_v1_hmdb51 3

ImageNet

1

32 (64/2)

48.5

0d0ad559

shell script

log

i3d_resnet50_v1_hmdb51 3

ImageNet, K400

1

32 (64/2)

70.9

2ec6bf01

shell script

log

Kinetics400 Dataset

The following table lists pre-trained models trained on Kinetics400.

Note

Our pre-trained models reproduce results from “Temporal Segment Networks (TSN)” 2 , “Inflated 3D Networks (I3D)” 3 , “Non-local Neural Networks” 4 . Please check the reference paper for further information.

InceptionV3 is trained and evaluated with input size of 299x299.

Clip Length is the number of frames within an input clip. 32 (64/2) means we use 32 frames, but actually the frames are formed by randomly selecting 64 consecutive frames from the video and then skipping every other frame. This strategy is widely adopted to reduce computation and memory cost.

Name

Pretrained

Segments

Clip Length

Top-1

Hashtag

Train Command

Train Log

inceptionv3_kinetics400 2

ImageNet

3

1

72.5

8a4a6946

shell script

log

resnet18_v1b_kinetics400 2

ImageNet

7

1

66.4

9d5cf9ec

shell script

log

resnet34_v1b_kinetics400 2

ImageNet

7

1

69.5

b91fcb2f

shell script

log

resnet50_v1b_kinetics400 2

ImageNet

7

1

70.6

e3ad0758

shell script

log

resnet101_v1b_kinetics400 2

ImageNet

7

1

71.5

f0a8dcb0

shell script

log

resnet152_v1b_kinetics400 2

ImageNet

7

1

72.3

1968220d

shell script

log

i3d_inceptionv1_kinetics400 3

ImageNet

1

32 (64/2)

71.7

f36bdeed

shell script

log

i3d_inceptionv3_kinetics400 3

ImageNet

1

32 (64/2)

73.3

bbd4185a

shell script

log

i3d_resnet50_v1_kinetics400 4

ImageNet

1

32 (64/2)

73.6

254ae7d9

shell script

log

i3d_resnet101_v1_kinetics400 4

ImageNet

1

32 (64/2)

74.8

c5721407

shell script

log

i3d_nl5_resnet50_v1_kinetics400 4

ImageNet

1

32 (64/2)

73.9

382433ba

shell script

log

i3d_nl10_resnet50_v1_kinetics400 4

ImageNet

1

32 (64/2)

74.5

26b41dd6

shell script

log

i3d_nl5_resnet101_v1_kinetics400 4

ImageNet

1

32 (64/2)

75.2

8b25d02f

shell script

log

i3d_nl10_resnet101_v1_kinetics400 4

ImageNet

1

32 (64/2)

75.3

77d7ed77

shell script

log

Something-Something-V2 Dataset

The following table lists pre-trained models trained on Something-Something-V2.

Note

Our pre-trained models reproduce results from “Temporal Segment Networks (TSN)” 2 , “Inflated 3D Networks (I3D)” 3 . Please check the reference paper for further information.

Name

Pretrained

Segments

Clip Length

Top-1

Hashtag

Train Command

Train Log

resnet50_v1b_sthsthv2 2

ImageNet

8

1

35.5

80ee0c6b

shell script

log

i3d_resnet50_v1_sthsthv2 3

ImageNet

1

16 (32/2)

50.6

01961e4c

shell script

log

1(1,2,3)

Limin Wang, Yuanjun Xiong, Zhe Wang and Yu Qiao. “Towards Good Practices for Very Deep Two-Stream ConvNets.” arXiv preprint arXiv:1507.02159, 2015.

2(1,2,3,4,5,6,7,8,9,10,11,12,13,14)

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang and Luc Van Gool. “Temporal Segment Networks: Towards Good Practices for Deep Action Recognition.” In European Conference on Computer Vision (ECCV), 2016.

3(1,2,3,4,5,6,7,8,9,10,11)

Joao Carreira and Andrew Zisserman. “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.” In Computer Vision and Pattern Recognition (CVPR), 2017.

4(1,2,3,4,5,6,7)

Xiaolong Wang, Ross Girshick, Abhinav Gupta and Kaiming He. “Non-local Neural Networks.” In Computer Vision and Pattern Recognition (CVPR), 2018.