Activity recognition from first person videos is a growing research area. The increasing diffusion of egocentric sensors in various devices makes it timely to develop approaches able to recognize fine grained first person actions like picking up, putting down, pouring and so forth. While most of previous work focused on RGB data, some authors pointed out the importance of leveraging over depth information in this domain. In this paper we follow this trend and we propose the first deep architecture that uses depth maps as an attention mechanism for first person activity recognition. Specifically, we blend together the RGB and depth data, so to obtain an enriched input for the network. This blending puts more or less emphasis on different parts of the image based on their distance from the observer, hence acting as an attention mechanism. To further strengthen the proposed activity recognition protocol, we opt for a self labeling approach. This, combined with a Conv-LSTM block for extracting temporal information from the various frames, leads to the new state of the art on two publicly available benchmark databases. An ablation study completes our experimental findings, confirming the effectiveness of our approach
Leveraging over depth in egocentric activity recognition / Planamente, Mirco; Russo, Paolo; Caputo, Barbara. - (2019). (Intervento presentato al convegno 1a Conferenza Italiana di Robotica e Macchine Intelligenti tenutosi a Roma nel 2019) [10.5281/zenodo.4782200].
Leveraging over depth in egocentric activity recognition
Planamente, Mirco;Russo, Paolo;Caputo, Barbara
2019
Abstract
Activity recognition from first person videos is a growing research area. The increasing diffusion of egocentric sensors in various devices makes it timely to develop approaches able to recognize fine grained first person actions like picking up, putting down, pouring and so forth. While most of previous work focused on RGB data, some authors pointed out the importance of leveraging over depth information in this domain. In this paper we follow this trend and we propose the first deep architecture that uses depth maps as an attention mechanism for first person activity recognition. Specifically, we blend together the RGB and depth data, so to obtain an enriched input for the network. This blending puts more or less emphasis on different parts of the image based on their distance from the observer, hence acting as an attention mechanism. To further strengthen the proposed activity recognition protocol, we opt for a self labeling approach. This, combined with a Conv-LSTM block for extracting temporal information from the various frames, leads to the new state of the art on two publicly available benchmark databases. An ablation study completes our experimental findings, confirming the effectiveness of our approachFile | Dimensione | Formato | |
---|---|---|---|
Conferenza_I-RIM_2019_extabs_103.pdf
accesso aperto
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
PUBBLICO - Tutti i diritti riservati
Dimensione
260.21 kB
Formato
Adobe PDF
|
260.21 kB | Adobe PDF | Visualizza/Apri |
Leveraging over depth in egocentric activity recognition.pdf
accesso aperto
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
PUBBLICO - Tutti i diritti riservati
Dimensione
267.76 kB
Formato
Adobe PDF
|
267.76 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2846440