Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition