Activity recognition from first person videos is a growing research area. The increasing diffusion of egocentric sensors in various devices makes it timely to develop approaches able to recognize fine grained first person actions like picking up, putting down, pouring and so forth. While most of previous work focused on RGB data, some authors pointed out the importance of leveraging over depth information in this domain. In this paper we follow this trend and we propose the first deep architecture that uses depth maps as an attention mechanism for first person activity recognition. Specifically, we blend together the RGB and depth data, so to obtain an enriched input for the network. This blending puts more or less emphasis on different parts of the image based on their distance from the observer, hence acting as an attention mechanism. To further strengthen the proposed activity recognition protocol, we opt for a self labeling approach. This, combined with a Conv-LSTM block for extracting temporal information from the various frames, leads to the new state of the art on two publicly available benchmark databases. An ablation study completes our experimental findings, confirming the effectiveness of our approach

Leveraging over depth in egocentric activity recognition / Planamente, Mirco; Russo, Paolo; Caputo, Barbara. - (2019). (Intervento presentato al convegno 1a Conferenza Italiana di Robotica e Macchine Intelligenti tenutosi a Roma nel 2019) [10.5281/zenodo.4782200].

Leveraging over depth in egocentric activity recognition

Planamente, Mirco;Russo, Paolo;Caputo, Barbara
2019

Abstract

Activity recognition from first person videos is a growing research area. The increasing diffusion of egocentric sensors in various devices makes it timely to develop approaches able to recognize fine grained first person actions like picking up, putting down, pouring and so forth. While most of previous work focused on RGB data, some authors pointed out the importance of leveraging over depth information in this domain. In this paper we follow this trend and we propose the first deep architecture that uses depth maps as an attention mechanism for first person activity recognition. Specifically, we blend together the RGB and depth data, so to obtain an enriched input for the network. This blending puts more or less emphasis on different parts of the image based on their distance from the observer, hence acting as an attention mechanism. To further strengthen the proposed activity recognition protocol, we opt for a self labeling approach. This, combined with a Conv-LSTM block for extracting temporal information from the various frames, leads to the new state of the art on two publicly available benchmark databases. An ablation study completes our experimental findings, confirming the effectiveness of our approach
File in questo prodotto:
File Dimensione Formato  
Conferenza_I-RIM_2019_extabs_103.pdf

accesso aperto

Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: PUBBLICO - Tutti i diritti riservati
Dimensione 260.21 kB
Formato Adobe PDF
260.21 kB Adobe PDF Visualizza/Apri
Leveraging over depth in egocentric activity recognition.pdf

accesso aperto

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: PUBBLICO - Tutti i diritti riservati
Dimensione 267.76 kB
Formato Adobe PDF
267.76 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2846440