Action Recognition¶
MXNet Pytorch
MXNet¶
Here is the model zoo for video action recognition task. We first show a visualization in the graph below, describing the inference throughputs vs. validation accuracy of Kinetics400 pre-trained models.
Hint
Training commands work with this script:
Download train_recognizer.py
A model can have differently trained parameters with different hashtags. Parameters with a grey name can be downloaded by passing the corresponding hashtag.
Download default pretrained weights:
net = get_model('i3d_resnet50_v1_kinetics400', pretrained=True)
Download weights given a hashtag:
net = get_model('i3d_resnet50_v1_kinetics400', pretrained='568a722e')
The test script Download test_recognizer.py
can be used for
evaluating the models on various datasets.
The inference script Download inference.py
can be used for
inferencing on a list of videos (demo purpose).
Hint
Training commands work with this script:
Download train_recognizer.py
A model can have differently trained parameters with different hashtags. Parameters with a grey name can be downloaded by passing the corresponding hashtag.
Download default pretrained weights:
net = get_model('i3d_resnet50_v1_kinetics400', pretrained=True)
Download weights given a hashtag:
net = get_model('i3d_resnet50_v1_kinetics400', pretrained='568a722e')
The test script Download test_recognizer.py
can be used for
evaluating the models on various datasets.
The inference script Download inference.py
can be used for
inferencing on a list of videos (demo purpose).
Kinetics400 Dataset¶
The following table lists pre-trained models trained on Kinetics400.
Note
Our pre-trained models reproduce results from recent state-of-the-art approaches. Please check the reference paper for further information.
All models are trained using input size 224x224, except InceptionV3
is trained and evaluated with input size of 299x299, C3D
and R2+1D
models are trained and evaluated with input size of 112x112.
Clip Length
is the number of frames within an input clip. 32 (64/2)
means we use 32 frames, but actually the frames are formed by randomly selecting 64 consecutive frames from the video and then skipping every other frame. This strategy is widely adopted to reduce computation and memory cost.
Segments
is the number of segments used during training. For testing (reporting these numbers), we use 250 views for 2D networks (25 frames and 10-crop) and 30 views for 3D networks (10 clips and 3-crop) following the convention.
For SlowFast
family of networks, our performance has a small gap to the numbers reported in the paper. This is because the official SlowFast implementation forces re-encoding every video to a fixed frame rate of 30. For fair comparison to other methods, we do not adopt that strategy, which leads to the small gap.
Name |
Pretrained |
Segments |
Clip Length |
Top-1 |
Hashtag |
Train Command |
Train Log |
---|---|---|---|---|---|---|---|
inceptionv1_kinetics400 3 |
ImageNet |
7 |
1 |
69.1 |
6dcdafb1 |
||
inceptionv3_kinetics400 3 |
ImageNet |
7 |
1 |
72.5 |
8a4a6946 |
||
resnet18_v1b_kinetics400 3 |
ImageNet |
7 |
1 |
65.5 |
46d5a985 |
||
resnet34_v1b_kinetics400 3 |
ImageNet |
7 |
1 |
69.1 |
8a8d0d8d |
||
resnet50_v1b_kinetics400 3 |
ImageNet |
7 |
1 |
69.9 |
cc757e5c |
||
resnet101_v1b_kinetics400 3 |
ImageNet |
7 |
1 |
71.3 |
5bb6098e |
||
resnet152_v1b_kinetics400 3 |
ImageNet |
7 |
1 |
71.5 |
9bc70c66 |
||
c3d_kinetics400 2 |
Scratch |
1 |
16 (32/2) |
59.5 |
a007b5fa |
||
p3d_resnet50_kinetics400 5 |
Scratch |
1 |
16 (32/2) |
71.6 |
671ba81c |
||
p3d_resnet101_kinetics400 5 |
Scratch |
1 |
16 (32/2) |
72.6 |
b30e3a63 |
||
r2plus1d_resnet18_kinetics400 6 |
Scratch |
1 |
16 (32/2) |
70.8 |
5a14d1f9 |
||
r2plus1d_resnet34_kinetics400 6 |
Scratch |
1 |
16 (32/2) |
71.6 |
de2e592b |
||
r2plus1d_resnet50_kinetics400 6 |
Scratch |
1 |
16 (32/2) |
73.9 |
deaefb14 |
||
i3d_inceptionv1_kinetics400 4 |
ImageNet |
1 |
32 (64/2) |
71.8 |
81e0be10 |
||
i3d_inceptionv3_kinetics400 4 |
ImageNet |
1 |
32 (64/2) |
73.6 |
f14f8a99 |
||
i3d_resnet50_v1_kinetics400 4 |
ImageNet |
1 |
32 (64/2) |
74.0 |
568a722e |
||
i3d_resnet101_v1_kinetics400 4 |
ImageNet |
1 |
32 (64/2) |
75.1 |
6b69f655 |
||
i3d_nl5_resnet50_v1_kinetics400 7 |
ImageNet |
1 |
32 (64/2) |
75.2 |
3c0e47ea |
||
i3d_nl10_resnet50_v1_kinetics400 7 |
ImageNet |
1 |
32 (64/2) |
75.3 |
bfb58c41 |
||
i3d_nl5_resnet101_v1_kinetics400 7 |
ImageNet |
1 |
32 (64/2) |
76.0 |
fbfc1d30 |
||
i3d_nl10_resnet101_v1_kinetics400 7 |
ImageNet |
1 |
32 (64/2) |
76.1 |
59186c31 |
||
slowfast_4x16_resnet50_kinetics400 8 |
Scratch |
1 |
36 (64/1) |
75.3 |
9d650f51 |
||
slowfast_8x8_resnet50_kinetics400 8 |
Scratch |
1 |
40 (64/1) |
76.6 |
d6b25339 |
||
slowfast_8x8_resnet101_kinetics400 8 |
Scratch |
1 |
40 (64/1) |
77.2 |
fbde1a7c |
Kinetics700 Dataset¶
The following table lists our trained models on Kinetics700.
Name |
Pretrained |
Segments |
Clip Length |
Top-1 |
Hashtag |
Train Command |
Train Log |
---|---|---|---|---|---|---|---|
i3d_slow_resnet101_f16s4_kinetics700 8 |
Scratch |
1 |
16 (64/4) |
67.65 |
299b1d9d |
NA |
NA |
UCF101 Dataset¶
The following table lists pre-trained models trained on UCF101.
Note
Our pre-trained models reproduce results from recent state-of-the-art approaches. Please check the reference paper for further information.
The top-1 accuracy number shown below is for official split 1 of UCF101 dataset, not the average of 3 splits.
InceptionV3
is trained and evaluated with input size of 299x299.
K400
is Kinetics400 dataset, which means we use model pretrained on Kinetics400 as weights initialization.
Name |
Pretrained |
Segments |
Clip Length |
Top-1 |
Hashtag |
Train Command |
Train Log |
---|---|---|---|---|---|---|---|
vgg16_ucf101 3 |
ImageNet |
3 |
1 |
83.4 |
d6dc1bba |
||
vgg16_ucf101 1 |
ImageNet |
1 |
1 |
81.5 |
05e319d4 |
||
inceptionv3_ucf101 3 |
ImageNet |
3 |
1 |
88.1 |
13ef5c3b |
||
inceptionv3_ucf101 1 |
ImageNet |
1 |
1 |
85.6 |
0c453da8 |
||
i3d_resnet50_v1_ucf101 4 |
ImageNet |
1 |
32 (64/2) |
83.9 |
7afc7286 |
||
i3d_resnet50_v1_ucf101 4 |
ImageNet, K400 |
1 |
32 (64/2) |
95.4 |
760d0981 |
HMDB51 Dataset¶
The following table lists pre-trained models trained on HMDB51.
Note
Our pre-trained models reproduce results from recent state-of-the-art approaches. Please check the reference paper for further information.
The top-1 accuracy number shown below is for official split 1 of HMDB51 dataset, not the average of 3 splits.
Name |
Pretrained |
Segments |
Clip Length |
Top-1 |
Hashtag |
Train Command |
Train Log |
---|---|---|---|---|---|---|---|
resnet50_v1b_hmdb51 3 |
ImageNet |
3 |
1 |
55.2 |
682591e2 |
||
resnet50_v1b_hmdb51 1 |
ImageNet |
1 |
1 |
52.2 |
ba66ee4b |
||
i3d_resnet50_v1_hmdb51 4 |
ImageNet |
1 |
32 (64/2) |
48.5 |
0d0ad559 |
||
i3d_resnet50_v1_hmdb51 4 |
ImageNet, K400 |
1 |
32 (64/2) |
70.9 |
2ec6bf01 |
Something-Something-V2 Dataset¶
The following table lists pre-trained models trained on Something-Something-V2.
Note
Our pre-trained models reproduce results from recent state-of-the-art approaches. Please check the reference paper for further information.
Name |
Pretrained |
Segments |
Clip Length |
Top-1 |
Hashtag |
Train Command |
Train Log |
---|---|---|---|---|---|---|---|
resnet50_v1b_sthsthv2 3 |
ImageNet |
8 |
1 |
35.5 |
80ee0c6b |
||
i3d_resnet50_v1_sthsthv2 4 |
ImageNet |
1 |
16 (32/2) |
50.6 |
01961e4c |
PyTorch¶
Here is the PyTorch model zoo for video action recognition task.
Hint
Training commands work with this script:
Download train_ddp_pytorch.py
python train_ddp_pytorch.py --config-file CONFIG
The test script Download test_ddp_pytorch.py
can be used for
performance evaluation on various datasets. Please set MODEL.PRETRAINED = True
in the configuration file if you would like to use
the trained models in our model zoo.
python test_ddp_pytorch.py --config-file CONFIG
Kinetics400 Dataset¶
The following table lists our trained models on Kinetics400.
Note
Our pre-trained models reproduce results from recent state-of-the-art approaches. Please check the reference paper for further information.
All models are trained using input size 224x224, except R2+1D
models are trained and evaluated with input size of 112x112.
Clip Length
is the number of frames within an input clip. 32 (64/2)
means we use 32 frames, but actually the frames are formed by randomly selecting 64 consecutive frames from the video and then skipping every other frame. This strategy is widely adopted to reduce computation and memory cost.
Segment
is the number of segments used during training. For testing (reporting these numbers), we use 250 views for 2D networks (25 frames and 10-crop) and 30 views for 3D networks (10 clips and 3-crop) following the convention.
The model weights of r2plus1d_v2_resnet152_kinetics400
, ircsn_v2_resnet152_f32s2_kinetics400
and TPN family
are ported from VMZ and TPN repository. You may ignore the training config of these models for now.
Name |
Pretrained |
Segment |
Clip Length |
Top-1 |
Hashtag |
Config |
---|---|---|---|---|---|---|
resnet18_v1b_kinetics400 3 |
ImageNet |
7 |
1 |
66.73 |
854b23e4 |
|
resnet34_v1b_kinetics400 3 |
ImageNet |
7 |
1 |
69.85 |
124a2fa4 |
|
resnet50_v1b_kinetics400 3 |
ImageNet |
7 |
1 |
70.88 |
9939dbdf |
|
resnet101_v1b_kinetics400 3 |
ImageNet |
7 |
1 |
72.25 |
172afa3b |
|
resnet152_v1b_kinetics400 3 |
ImageNet |
7 |
1 |
72.45 |
3dedb835 |
|
r2plus1d_v1_resnet18_kinetics400 6 |
Scratch |
1 |
16 (32/2) |
71.72 |
340a5952 |
|
r2plus1d_v1_resnet34_kinetics400 6 |
Scratch |
1 |
16 (32/2) |
72.63 |
5102fd17 |
|
r2plus1d_v1_resnet50_kinetics400 6 |
Scratch |
1 |
16 (32/2) |
74.92 |
9a3b665c |
|
r2plus1d_v2_resnet152_kinetics400 6 |
IG65M |
1 |
16 (32/2) |
81.34 |
42707ffc |
|
ircsn_v2_resnet152_f32s2_kinetics400 10 |
IG65M |
1 |
32 (64/2) |
83.18 |
82855d2c |
|
i3d_resnet50_v1_kinetics400 4 |
ImageNet |
1 |
32 (64/2) |
74.87 |
18545497 |
|
i3d_resnet101_v1_kinetics400 4 |
ImageNet |
1 |
32 (64/2) |
75.1 |
a9bb4f89 |
|
i3d_nl5_resnet50_v1_kinetics400 7 |
ImageNet |
1 |
32 (64/2) |
75.17 |
9df1e103 |
|
i3d_nl10_resnet50_v1_kinetics400 7 |
ImageNet |
1 |
32 (64/2) |
75.93 |
281e1e8a |
|
i3d_nl5_resnet101_v1_kinetics400 7 |
ImageNet |
1 |
32 (64/2) |
75.81 |
2cea8edd |
|
i3d_nl10_resnet101_v1_kinetics400 7 |
ImageNet |
1 |
32 (64/2) |
75.93 |
526a2ed0 |
|
slowfast_4x16_resnet50_kinetics400 8 |
Scratch |
1 |
32 (64/2) |
75.25 |
1d1eadb2 |
|
slowfast_8x8_resnet50_kinetics400 8 |
Scratch |
1 |
32 (64/2) |
76.66 |
e94e9a57 |
|
slowfast_8x8_resnet101_kinetics400 8 |
Scratch |
1 |
32 (64/2) |
76.95 |
db5e9fef |
|
i3d_slow_resnet50_f32s2_kinetics400 8 |
Scratch |
1 |
32 (64/2) |
77.89 |
078c817b |
|
i3d_slow_resnet50_f16s4_kinetics400 8 |
Scratch |
1 |
16 (64/4) |
76.36 |
a3e419f1 |
|
i3d_slow_resnet50_f8s8_kinetics400 8 |
Scratch |
1 |
8 (64/8) |
74.41 |
1c3d98a1 |
|
i3d_slow_resnet101_f32s2_kinetics400 8 |
Scratch |
1 |
32 (64/2) |
78.57 |
db37cd51 |
|
i3d_slow_resnet101_f16s4_kinetics400 8 |
Scratch |
1 |
16 (64/4) |
77.11 |
cb6b78d9 |
|
i3d_slow_resnet101_f8s8_kinetics400 8 |
Scratch |
1 |
8 (64/8) |
76.15 |
82e399c1 |
|
tpn_resnet50_f8s8_kinetics400 9 |
Scratch |
1 |
8 (64/8) |
77.04 |
368108eb |
|
tpn_resnet50_f16s4_kinetics400 9 |
Scratch |
1 |
16 (64/4) |
77.33 |
6bf899df |
|
tpn_resnet50_f32s2_kinetics400 9 |
Scratch |
1 |
32 (64/2) |
78.9 |
27710ce8 |
|
tpn_resnet101_f8s8_kinetics400 9 |
Scratch |
1 |
8 (64/8) |
78.1 |
092c2f7f |
|
tpn_resnet101_f16s4_kinetics400 9 |
Scratch |
1 |
16 (64/4) |
79.39 |
647080df |
|
tpn_resnet101_f32s2_kinetics400 9 |
Scratch |
1 |
32 (64/2) |
79.7 |
a94422a9 |
Kinetics700 Dataset¶
The following table lists our trained models on Kinetics700.
Name |
Pretrained |
Segment |
Clip Length |
Top-1 |
Hashtag |
Config |
---|---|---|---|---|---|---|
i3d_slow_resnet101_f16s4_kinetics700 8 |
Scratch |
1 |
16 (64/4) |
67.65 |
b5be1a2e |
Something-Something-V2 Dataset¶
The following table lists our trained models on Something-Something-V2.
Note
Our pre-trained models reproduce results from recent state-of-the-art approaches. Please check the reference paper for further information.
Name |
Pretrained |
Segment |
Clip Length |
Top-1 |
Hashtag |
Config |
---|---|---|---|---|---|---|
resnet50_v1b_sthsthv2 3 |
ImageNet |
8 |
1 |
35.16 |
cbb9167b |
|
i3d_resnet50_v1_sthsthv2 4 |
ImageNet |
1 |
16 (32/2) |
49.61 |
e975d989 |
Reference¶
- 1(1,2,3)
Limin Wang, Yuanjun Xiong, Zhe Wang and Yu Qiao. “Towards Good Practices for Very Deep Two-Stream ConvNets.” arXiv preprint arXiv:1507.02159, 2015.
- 2
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani and Manohar Paluri. “Learning Spatiotemporal Features with 3D Convolutional Networks.” In International Conference on Computer Vision (ICCV), 2015.
- 3(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17)
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang and Luc Van Gool. “Temporal Segment Networks: Towards Good Practices for Deep Action Recognition.” In European Conference on Computer Vision (ECCV), 2016.
- 4(1,2,3,4,5,6,7,8,9,10,11,12)
Joao Carreira and Andrew Zisserman. “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.” In Computer Vision and Pattern Recognition (CVPR), 2017.
- 5(1,2)
Zhaofan Qiu, Ting Yao and Tao Mei. “Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks.” In International Conference on Computer Vision (ICCV), 2017.
- 6(1,2,3,4,5,6,7)
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun and Manohar Paluri. “A Closer Look at Spatiotemporal Convolutions for Action Recognition.” In Computer Vision and Pattern Recognition (CVPR), 2018.
- 7(1,2,3,4,5,6,7,8)
Xiaolong Wang, Ross Girshick, Abhinav Gupta and Kaiming He. “Non-local Neural Networks.” In Computer Vision and Pattern Recognition (CVPR), 2018.
- 8(1,2,3,4,5,6,7,8,9,10,11,12,13,14)
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik and Kaiming He. “SlowFast Networks for Video Recognition.” In International Conference on Computer Vision (ICCV), 2019.
- 9(1,2,3,4,5,6)
Yang, Ceyuan and Xu, Yinghao and Shi, Jianping and Dai, Bo and Zhou, Bolei. “Temporal Pyramid Network for Action Recognition.” In Computer Vision and Pattern Recognition (CVPR), 2020.
- 10
Du Tran, Heng Wang, Lorenzo Torresani and Matt Feiszli. “Video Classification with Channel-Separated Convolutional Networks.” In International Conference on Computer Vision (ICCV), 2019.