MPOSE2021: a Dataset for Short-Time Pose-Based Human Action Recognition

Mazzia, Vittorio; Angarano, Simone; Salvetti, Francesco; Angelini, Federico; Chiaberge, Marcello

doi:10.5281/zenodo.5506688

This repository contains the MPOSE2021 Dataset for short-time pose-based Human Action Recognition (HAR). MPOSE2021 is specifically designed to perform short-time Human Action Recognition. MPOSE2021 is developed as an evolution of the MPOSE Dataset [1-3]. It is made by human pose data detected by OpenPose [4] and Posenet [11] on popular datasets for HAR, i.e. Weizmann [5], i3DPost [6], IXMAS [7], KTH [8], UTKinetic-Action3D (RGB only) [9] and UTD-MHAD (RGB only) [10], alongside original video datasets, i.e. ISLD and ISLD-Additional-Sequences [1]. Since these datasets have heterogenous action labels, each dataset labels are remapped to a common and homogeneous list of actions. Generated sequences have a number of frames between 20 and 30. Sequences are obtained by cutting the so-called Precursor videos (video from the above-mentioned datasets), with non-overlapping sliding windows. Frames where OpenPose/PoseNet cannot detect any subject are automatically discarded. Resulting samples contain one subject at the time, performing a fraction of a single action. Overall, MPOSE2021 contains 15429 samples, divided into 20 actions, performed by 100 subjects. More information about the dataset can be found in the MPOSE2021 repository, also providing a user-friendly Python package to import and use the dataset by just running the command pip install mpose Data Structure The repository contains 3 datasets for each pose extractor (namely 1, 2 and 3) which consist of the same data divided in different train/test splits. Each dataset contains X and y numpy arrays for both training and testing. X has the following shape: (B, T, K, C) where B is the batch number; T (= 30) is the duration of the sequences in frames (zero-padded in the case of shorter sequences); K (= 17 for PoseNet and 25 for OpenPose) is the number of pose keypoints; C (= 3) is the number of channels, comprehending 2D keypoint coordinates (x,y) in the original video reference frame and the keypoint confidence (p <= 1) The .txt files specifying the metadata associated with the split samples are also included. References MPOSE2021 is part of a paper published by the Pattern Recognition Journal (Elsevier), and is intended for scientific research purposes. If you want to use MPOSE2021 for your research work, please also cite [1-11]. @article{mazzia2021action, title={Action Transformer: A Self-Attention Model for Short-Time...

PORTO @ Archivio Istituzionale della Ricerca