Video Action Understanding: Tasks
- Action Recognition (AR) is the process of classifying a complete input (either an entire video or a specified segment) by the action occurring in the input.
- Action Prediction (AP) is the process of classifying an incomplete input by the action yet to be observed.
- Temporal Action Proposal (TAP) is the process of partitioning an input video into segments (consecutive series of frames) of action and inaction by indicating start and end markers of each action instance.
- Temporal Action Localization/Detection (TAL/D) is the process of creating temporal action proposals and classifying each action.
- Spatiotemporal Action Proposal (SAP) is the process of partitioning an input video by both space (bounding boxes) and time (per-frame OR start and end markers of a segment) between regions of action and inaction.
- Spatiotemporal Action Localization/Detection (SAL/D) is the process of creating spatiotemporal action proposals and classifying each frame’s bounding boxes (or action tubes when a linking strategy is applied).
Video Action Datasets
Action Recognition Datasets
Sports-1M[1]
https://cs.stanford.edu/people/karpathy/deepvideo/
Sports-1M was produced in 2014 as a large-scale video classification benchmark for comparing CNNs. Examples of the 487 sports action classes include “cycling”, “snowboarding”, and “american football”. Note that some inter-class variation is low (e.g. classes include 23 types of billiards, 6 types of bowling, and 7 types of American football). Videos were collected from YouTube and weakly annotated using text metadata. The dataset consists of one million videos with a 70/20/10 training/validation/test split. On average, videos are ∼5.5 minutes long, and approximately 5% are annotated with > 1 class. As one of the first large-scale datasets, Sports-1M was critical for demonstrating the effectiveness of CNN architectures for feature learning.
Something-Something[2, 3]
https://www.twentybn.com/datasets/something-something
Something-Something [2] (a.k.a. 20BN-SOMETHING-SOMETHING) was produced in 2017 as a human-object interaction benchmark. Examples of the 174 classes include “holding something”, “turning something upside down”, and “folding something”. Video creation was crowd-sourced through Amazon Mechanical Turk (AMT). The dataset consists of 108,499 videos with an 80/10/10 training/validation/test split. Each single-instance video lasts for 2-6 seconds. The dataset was expanded to Something-Something-v2 [3] in 2018 by increasing the size to 220,847 videos, adding object annotations, reducing label noise, and improved video resolution. These datasets are important benchmarks for human-object interaction due to their scale and quality.
Kinetics-400, Knetics-600 and Knetics-700[4, 5, 6]
https://deepmind.com/research/open-source/kinetics
The Kinetics dataset family was produced as “a large-scale, high quality dataset of URL links” to human action video clips focusing on human-object interactions and human-human interactions. Kinetics-400 [4] was released in 2017, and examples of the 400 human actions include “hugging”, “mowing lawn”, and “washing dishes”. Video clips were collected from YouTube and annotated by AMT crowd-workers. The dataset consists of 306,245 videos. Within each class, 50 are reserved for validation and 100 are reserved for testing. Each single-instance video lasts for ∼10 seconds. The dataset was expanded to Kinetics-600 [5] in 2018 by increasing the number of classes to 600 and the number of videos to 495,547. The dataset was expanded again to Kinetics-700 [6] in 2019 by increasing to 700 classes and 650,317 videos. These are among the most cited human action datasets in the field and continue to serve as a standard benchmark and pretraining source.
NTU RGB-D[7, 8]
http://rose1.ntu.edu.sg/datasets/actionrecognition.asp
NTU RGB-D[7] was produced in 2016 as “a large-scale dataset for RGB-D human action recognition.” The multi-modal nature provides depth maps, 3D skeletons, and infrared in addition to RGB video. Examples of the 60 human actions include “put on headphone”, “toss a coin”, and “eat meal”. Videos were captured with a Microsoft Kinect v2 in a variety of settings. The dataset consists of 56,880 single-instance video clips from 40 different subjects in 80 different views. Training and validation splits are not specified. The dataset was improved to NTU RGB-D 120 [8] in 2019 by increasing the number of classes to 120, videos to 114,480, subjects to 106, and views to 155. This serves as a state-of-the-art benchmark for human AR with non-RGB modalities.
Moments in Time(MiT)[9, 10]
Moments in Time (MiT) [9] was produced in 2018 with a focus on broadening action understanding to include people, objects, animals, and natural phenomenon. Examples of the 339 diverse action classes include “running”, “opening”, and “picking”. Videos clips were collected from a variety of internet sources and annotated by AMT crowd-workers. The dataset consists of 903,964 videos with a roughly 89/4/7 training/validation/test split. Each single-instance video lasts for 3 seconds. The dataset was improved to Multi-Moments in Time (M-MiT) [10] in 2019 by increasing the number of videos to 1.02 million, pruning vague classes, and increasing the number of labels per video (2.01 million total labels). MiT and M-MiT are interesting benchmarks because of the focus on inter-class and intra-class variation.
Jester[11]
https://20bn.com/datasets/jester
Jester [11] (a.k.a. 20BN-JESTER) was produced in 2019 as “a large collection of densely labeled video clips that show humans performing pre-defined hand gestures in front of laptop camera or webcam.” Examples of the 27 human hand gestures include “drumming fingers”, “shaking hand”, and “swiping down”. Data creation was crowd-sourced through AMT. The dataset consists of 148,092 videos with an 80/10/10 training/validation/test split. Each single-instance video lasts for ∼3 seconds. The Jester dataset is the first large-scale, semantically low-level human AR dataset.
AViD [12]
https://github.com/piergiaj/AViD
Anonymized Videos from Diverse countries (AViD) [12] was produced in 2020 with the intent of (1) avoiding the western bias of many datasets by providing human actions (and some non-human actions) from a diverse set of people and cultures, (2) anonymizing all human faces to protect the privacy of the individuals, and (3) ensuring that all videos in the dataset are static with a creative commons license. Most of the 887 classes are drawn from Kinetics, Charades, and MiT while removing duplicates and any actions that involve the face (e.g. “smiling”). 159 actions not found in any of those datasets are also added. Web videos in 22 different languages were annotated by AMT crowd-workers. The dataset consists of approximately 450,000 videos with a 90/10 training/validation split. Each single-instance video lasts between 3 and 15 seconds. We believe AViD will quickly become a foundational benchmark because of the emphasis on diversity of actors and privacy standards.
UCF101[13]
https://www.crcv.ucf.edu/data/UCF101.php
UCF101 is an action recognition data set of realistic action videos, collected from YouTube, having 101 action categories. This data set is an extension of UCF50 data set which has 50 action categories.
With 13320 videos from 101 action categories, UCF101 gives the largest diversity in terms of actions and with the presence of large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions, etc, it is the most challenging data set to date. As most of the available action recognition data sets are not realistic and are staged by actors, UCF101 aims to encourage further research into action recognition by learning and exploring new realistic action categories.
The videos in 101 action categories are grouped into 25 groups, where each group can consist of 4-7 videos of an action. The videos from the same group may share some common features, such as similar background, similar viewpoint, etc.
ActivityNet[14, 15]
http://activity-net.org/index.html
The ActivityNet dataset [21, 88] family was produced “to compare algorithms for human activity understanding: global video classification, trimmed activity recognition and activity detection.” Example human action classes include “Drinking coffee”, “Getting a tattoo”, and “Ironing clothes”. ActivityNet 100 (v1.2) was released in 2015. The 100-class dataset consists of 9,682 videos divided into a 4,819 videos (7,151 instances) training set, a 2,383 videos (3,582 instances) validation set, and a 2,480 videos test set. ActivityNet 200 (v1.3) was released in 2016. The 200-class dataset consists of 19,994 videos divided into a 10,024 videos (15,410 instances) training set, a 4,926 videos (7,654 instances) validation set, and a 5,044 videos test set. On average, action instances are 51.4 seconds long. Web videos were temporal annotated by AMT crowd-workers. ActivityNet has remained as a foundational benchmark for TAP and TAL/D because of the dataset scope and size. It is also commonly applied as an untrimmed multi-label AR benchmark.
Charades[16]
https://prior.allenai.org/projects/charades
Charades was produced in 2016 as a crowd-sourced dataset of daily human activities. Examples of the 157 classes include “pouring into cup”, “running”, and “folding towel”. The dataset consists of 9,848 videos (66,500 temporal action annotations) with a roughly 80/20 training/validation split. Videos were filmed in 267 homes with an average length of 30.1 seconds and an average of 6.8 actions per video. Action instances average 12.8 seconds long. Charades-Ego was released in 2018 using similar methodologies and the same 157 classes. However, in this dataset, an egocentric (first-person) view and a third-person view is available for each video. The dataset consists of 7,860 videos (68.8 hours) capturing 68,536 temporally annotated action instances. Charades has served as a TAL/D benchmark along with ActivityNet, but it also has found a use as a multi-label AR benchmark because of the high average number of actions per video. Charades-Ego presents a multi-view quality unique among large-scale daily human action datasets.
MultiTHUMOS[17]
https://ai.stanford.edu/~syyeung/everymoment.html
MultiTHUMOS [17] was produced in 2017 as an extension of the dataset used in the 2014 THUMOS Challenge. Examples of the 65 human action classes include “throw”, “hug”, and “talkToCamera”. The dataset consists of 413 videos (30 hours) with 38,690 multi-label, frame-level annotations (an average of 1.5 per frame). The total number of action instances—where an instance is a set of sequential frames with the same action annotation—is not reported. The number of action instances per class is extremely variable ranging from “VolleyballSet” with 15 to “Run” with 3,500. Each action instance lasts on average for 3.3 seconds with some lasting only 66 milliseconds (2 frames). Like Charades, the MultiTHUMOS dataset offers a benchmark for multi-label TAP and TAL/D. It stands out due to its dense multi-labeling scheme.
VIRAT [18]
VIRAT [18] was created in 2011 as “a new large-scale surveillance video dataset designed to assess the performance of event recognition algorithms in realistic scenes.” It includes both ground and aerial surveillance videos. Examples of the 23 classes include “picking up”, “getting in a vehicle”, and “exiting a facility”. The dataset consists of 17 videos (29 hours) with between 10 and 1,500 action instances per class. Due to the camera to action distance across the varying views, the human to video height ratio is between 2% and 20%. Crowd-workers created bounding boxes around moving objects and temporal event annotations. While this is a smaller dataset, VIRAT is the highest quality surveillance-based spatiotemporal dataset and is used in the latest SAL/D competitions.
Video Action Challenges
References
[1] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale Video Classification with Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[2] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. 2017. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
[3] Farzaneh Mahdisoltani, Guillaume Berger, Waseem Gharbieh, David Fleet, and Roland Memisevic. 2018. On the effectiveness of task granularity for transfer learning. arXiv:1804.09235 [cs.CV]
[4] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. arXiv:1705.06950 [cs.CV]
[5] Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. 2018. A Short Note about Kinetics-600. arXiv:1808.01340 [cs.CV]
[6] Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. 2019. A Short Note on the Kinetics-700 Human Action Dataset. arXiv:1907.06987 [cs.CV]
[7] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. 2016. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. arXiv:1604.02808 [cs.CV]
[8] J. Liu, A. Shahroudy, M. L. Perez, G. Wang, L. Duan, and A. Kot Chichung. 2019. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019), 1–1.
[9] Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfruend, Carl Vondrick, and Aude Oliva. 2018. Moments in Time Dataset: one million videos for event understanding. arXiv:1801.03150 [cs.CV]
[10] Mathew Monfort, Kandan Ramakrishnan, Alex Andonian, Barry A McNamara, Alex Lascelles, Bowen Pan, Quanfu Fan, Dan Gutfreund, Rogerio Feris, and Aude Oliva. 2019. Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding. arXiv:1911.00232 [cs.CV]
[11] Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfruend, Carl Vondrick, and Aude Oliva. 2018. Moments in Time Dataset: one million videos for event understanding. arXiv:1801.03150 [cs.CV]
[12] AJ Piergiovanni and Michael S. Ryoo. 2020. AViD Dataset: Anonymized Videos from Diverse Countries. arXiv:2007.05515 [cs.CV]
[13] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv:1212.0402 [cs.CV]
[14] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Fabian Caba Heilbron and Juan Carlos Niebles. 2014. Collecting and Annotating Human Activities in Web Videos. In Proceedings of International Conference on Multimedia Retrieval. ACM, 377.
[16]Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, and Li Fei-Fei. 2018. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. International Journal of Computer Vision 126, 2 (2018), 375–389.
[17] Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, and Li Fei-Fei. 2018. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. International Journal of Computer Vision 126, 2 (2018), 375–389.
[18] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. Chen, J. T. Lee, S. Mukherjee, J. K. Aggarwal, H. Lee, L. Davis, E. Swears, X. Roy-Chowdhury, and M. Desai. 2011. A large-scale benchmark dataset for event recognition in surveillance video. In CVPR 2011. 3153–3160.
This post is mainly based on paper: https://arxiv.org/abs/2010.06647