3D Object-Oriented Learning: An End-to-end Transformation-Disentangled 3D Representation

We provide more detailed explanation of the ideas behind a recent paper on “Object-Oriented Deep Learning” [1] and extend it to handle 3D inputs/outputs. Similar to [1], every layer of the system takes in a list of “objects/symbols”, processes it and outputs another list of objects/symbols. In this...

Full description

Bibliographic Details
Main Authors: Liao, Qianli, Poggio, Tomaso
Format: Technical Report
Language:en_US
Published: 2018
Online Access:http://hdl.handle.net/1721.1/113002
Description
Summary:We provide more detailed explanation of the ideas behind a recent paper on “Object-Oriented Deep Learning” [1] and extend it to handle 3D inputs/outputs. Similar to [1], every layer of the system takes in a list of “objects/symbols”, processes it and outputs another list of objects/symbols. In this report, the properties of the objects/symbols are extended to contain 3D information — including 3D orientations (i.e., rotation quaternion or yaw, pitch and roll) and one extra coordinate dimension (z-axis or depth). The resultant model is a novel end-to-end interpretable 3D representation that systematically factors out common 3D transformations such as translation and 3D rotation. As first proposed by [1] and discussed in more detail in [2], it offers a “symbolic disentanglement” solution to the problem of transformation invariance/equivariance. To demonstrate the effectiveness of the model, we show that it can achieve perfect performance on the task of 3D invariant recognition by training on one rotation of a 3D object and test it on 3D rotations (i.e., at arbitrary angles of yaw, pitch and roll). Furthermore, in a more realistic case where depth information is not given (similar to viewpoint invariant object recognition from 2D vision) our model generalizes reasonably well to novel viewpoints while ConvNets fail to generalize.