Summary: | <p>Understanding and generating people in images and videos is a long-standing goal in computer vision. A significant effort has been devoted to these tasks by the research community along the last decades, greatly motivated by a large number of potential applications, like surveillance, human-machine interaction, action and behaviour recognition, motion capture, video reenactment, and computer graphics animation. Also driving the necessity of this remarkable endeavour, one can mention the difficulties for tackling such problems, generated for instance by the endless combinations of environments, visual appearances, and postures in which humans can appear in images. Besides that, the high-dimensionality of the human body, the inherent noise of visual data and the ill-posed characteristics of the problems are also relevant issues. Nonetheless, meaningful advances in the field were achieved recently using deep learning. </p> <p>This thesis pursues further advances towards understanding and generating people in visual data by the development of new discriminative and generative deep learning methods. The main contributions are: </p> <p>i) A deep learning framework for 2D human pose estimation, which allows for mean-field inference over part-based models; </p> <p>ii) A conditional deep generative model that achieves state-of-the-art results on generating images of humans conditioned on body posture; and </p> <p>iii) A structured semi-supervised deep generative model that jointly performs pose estimation and image generation, <em>understanding</em> and <em>generating</em> people in images in a single framework.</p>
|