Action Representation and Recognition

Daniel Weinland

Résumé

Recognizing human actions is an important and challenging topic in computer vision, withmany important applications including video surveillance, video indexing and understanding of social interaction. From a computational perspective, actions can be defined as four-dimensional patterns, in space and in time. Such patterns can be modeled using several representations which differ from each other with respect to, among others, the visual information used, e.g. shape or appearance, the representation of dynamics, e.g. implicit or explicit, and the amount of invariance that the representation exhibits, e.g. a viewpoint invariance allowing to learn and recognize using different camera configurations. Our goal in this thesis is to develop a set of new techniques for action recognition. In the first part we present "Motion History Volumes", a free-viewpoint representation for human actions based on 3D visual-hull reconstructions computed form multiple calibrated, and backgroundsubtracted, video cameras. Results indicate that this representation can be used to learn and recognize basic human action classes, independently of gender, body size and viewpoint. We then present in the second part an approach based on a 3D exemplar-based HMM, which addresses the problem of recognizing actions from arbitrary views, even from a single camera. We will thus no longer require a 3D reconstruction during the recognition phase, instead we will use learned 3D models to produce 2D image information, which is compared to the observations. In the third and last part, we present a compact and efficient exemplar-based representation, which in particular does not attempt to encode the dynamics of an action through temporal dependencies. In experimental results we demonstrate that such a representation can precisely recognize actions, even with cluttered and non-background-segmented sequences.

La reconnaissance d'actions et d'activités humaines est un thème de recherche ambitieux en vision par ordinateur, avec d'importantes et nombreuses applications, notamment pour la vidéo surveillance et les environnements interactifs et intelligents. D'un point de vue computationel une action peut être définie comme une entité de dimension 4 dans le l'espace et le temps. Plusieurs représentations peuvent alors être envisagées qui diffèrent par les informations considérées, par exemple : la forme ou l'apparence, la représentation explicite ou implicite du déroulement d'une action - la dynamique, l'invariance du modèle au genre, taille et corpulence et l'invariance au point de vue qui permet d'apprendre et de reconnaître une action avec des configurations de caméras différentes. Dans cette thèse, nous étudions ces représentations et leurs impacts sur la reconnaissance d'actions. Nous nous intéressons en particulier à l'invariance des représentations, à la modélisation de la dynamique d'une action et à la manière de segmenter une action. Nos resultats démontrent que la reconnaissance d'actions simples, par exemple se lever ou courir, peut s'effectuer independamment de point de vue, des caractéristiques propres du corps observé et de la dynamique de l'action.

Action Representation and Recognition

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager