Deep neural networks based purely on attention have been successful across several domains, relying on minimal architectural priors from the designer. In Human Action Recognition (HAR), attention mechanisms have been primarily adopted on top of standard convolutional or recurrent layers, improving the overall generalization capability. In this work, we introduce Action Transformer (AcT), a simple, fully, self-attentional architecture that consistently outperforms more elaborated networks that mix convolutional, recurrent, and attentive layers. In order to limit computational and energy requests, building on previous human action recognition research, the proposed approach exploits 2D pose representations over small temporal windows, providing a low latency solution for accurate and effective real-time performance. Moreover, we open-source MPOSE2021, a new large-scale dataset, as an attempt to build a formal training and evaluation benchmark for real-time, short-time HAR. The proposed methodology was extensively tested on MPOSE2021 and compared to several state-of-the-art architectures, proving the effectiveness of the AcT model and laying the foundations for future work on HAR.
Action Transformer: A Self-Attention Model for Short-Time Pose-Based Human Action Recognition / Mazzia, Vittorio; Angarano, Simone; Salvetti, Francesco; Angelini, Federico; Chiaberge, Marcello. - In: PATTERN RECOGNITION. - ISSN 0031-3203. - ELETTRONICO. - 124:(2022), p. 108487. [10.1016/j.patcog.2021.108487]
Action Transformer: A Self-Attention Model for Short-Time Pose-Based Human Action Recognition
Mazzia, Vittorio;Angarano, Simone;Salvetti, Francesco;Chiaberge, Marcello
2022
Abstract
Deep neural networks based purely on attention have been successful across several domains, relying on minimal architectural priors from the designer. In Human Action Recognition (HAR), attention mechanisms have been primarily adopted on top of standard convolutional or recurrent layers, improving the overall generalization capability. In this work, we introduce Action Transformer (AcT), a simple, fully, self-attentional architecture that consistently outperforms more elaborated networks that mix convolutional, recurrent, and attentive layers. In order to limit computational and energy requests, building on previous human action recognition research, the proposed approach exploits 2D pose representations over small temporal windows, providing a low latency solution for accurate and effective real-time performance. Moreover, we open-source MPOSE2021, a new large-scale dataset, as an attempt to build a formal training and evaluation benchmark for real-time, short-time HAR. The proposed methodology was extensively tested on MPOSE2021 and compared to several state-of-the-art architectures, proving the effectiveness of the AcT model and laying the foundations for future work on HAR.| File | Dimensione | Formato | |
|---|---|---|---|
| AcT___Pattern_Recognition.pdf Open Access dal 16/12/2023 
											Descrizione: Post-print
										 
											Tipologia:
											2. Post-print / Author's Accepted Manuscript
										 
											Licenza:
											
											
												Creative commons
												
												
													
													
													
												
												
											
										 
										Dimensione
										7.19 MB
									 
										Formato
										Adobe PDF
									 | 7.19 MB | Adobe PDF | Visualizza/Apri | 
| 1-s2.0-S0031320321006634-main.pdf accesso riservato 
											Descrizione: Post-print versione editoriale
										 
											Tipologia:
											2a Post-print versione editoriale / Version of Record
										 
											Licenza:
											
											
												Non Pubblico - Accesso privato/ristretto
												
												
												
											
										 
										Dimensione
										2.88 MB
									 
										Formato
										Adobe PDF
									 | 2.88 MB | Adobe PDF | Visualizza/Apri Richiedi una copia | 
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2946512
			
		
	
	
	
			      	