Rapid Visual Object Learning in Humans is Explainable by Low-Dimensional Image Representations
How humans learn to recognize new objects is an open problem. In this thesis, we consider one class of theories for how this is accomplished: humans re-represent incoming retinal images in a stable, multidimensional Euclidean space, and build linear decoders in this space for new object categories f...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2023
|
Online Access: | https://hdl.handle.net/1721.1/147557 https://orcid.org/0000-0002-2576-6059 |
Summary: | How humans learn to recognize new objects is an open problem. In this thesis, we consider one class of theories for how this is accomplished: humans re-represent incoming retinal images in a stable, multidimensional Euclidean space, and build linear decoders in this space for new object categories from image exemplars.
In Part I, we empirically characterize human learning behavior over a battery of different learning subtasks, and find humans rapidly learn new objects from a small number of examples. We then build neurally-mechanistic, end-to-end models of object learning based on recent advances in image-computable models of ventral stream representations. We point to shortcomings of these models, including the fact none of these models actually match the ability to human few-shot learn.
In Part II, we analyze this few-shot learning failure from a theoretical perspective, and show that a geometric property of image representations — variation in directions orthogonal to the one needed to linearly solve the task — slows learning. Given this observation, we motivate the hypothesis that current models of visual processing represent images along a much higher number of dimensions, relative to humans.
In Part III, we identify (and remove) these hypothesized excess dimensions by developing the "perceptual alignment" method, where we combine a classical approach in experimental psychology — inferring internal stimulus representations using measurements of human similarity judgements — with deep learning methods, and create new, lower-dimensional, image-computable representations which capture patterns of human similarity judgements. Finally, we show models based on these new representations predict the ability of humans to few-shot learn across a variety of object domains. They also successfully predict the inability of humans to learn tasks based on representational dimensions that are present in baseline models but absent in perceptually aligned ones. Taken together, this thesis shows specific, neurally-mechanistic models based on a simple theory of learning are strong accounts of how humans rapidly learn new objects. |
---|