Summary: | Although the exponential growth of visual data in various forms, such as images and videos, enables unprecedented opportunities for us to interpret the surrounding environment, natural language is still the main way to convey knowledge and information among us. Therefore, there is currently an increasing demand for building a framework to achieve interaction between pieces of information from different modalities.
In this thesis, I investigate three directions to achieve an effective interaction between multi-modal information. The first direction focuses on building a consistent representation for information with the similar semantic meaning. More specifically, in a high-dimensional semantic space, the representations for similar information should be close to each other within an appropriate range, regardless of their modalities. The second direction is to achieve an effective correlation between image visual attributes with corresponding semantic words, which first requires the network to recognize different semantic information from both the image and text, and then allows it to interact with them. The third direction is to construct a lightweight architecture for models having inputs from multiple domains. This is because when a network involves multi-modal information, the number of trainable parameters might need to considerably increase, which aims to allow the network to comprehensively learn to capture the correlations between the information with a domain gap. The requirements for having significant computing resources might greatly hinder the deployment of a framework, which is not practical for the implementation in real-world applications. The contributions in these directions are as follows.
First, to have a consistent representation, both contrastive and clustering learning are adopted in the generative network, where contrastive learning can maximize the mutual information between paired instances provided by a given dataset, and clustering learning can group instances with similar semantic meaning into the same cluster, and push dissimilar ones far away from each other. By doing this, a structured joint semantic space can be built, where instances with similar semantic meaning can be closely grouped together within an appropriate range to ensure a consistent representation, regardless of their modalities.
Second, to achieve an effective correlation between multi-modal information, three different approaches are proposed, where they effectively correlate image visual attributes with corresponding semantic text descriptions, allowing the network to learn to understand semantic meaning of both text and image information, and then achieving an effective interaction. More specially, for the exploration of correlation, I first investigate a word-level attention-based connection, helped with a complementary word-level discriminator, where attention allows the network to learn to identify specific image visual attributes, aligned with the corresponding semantic words, and the complementary word-level discriminator provides fine-grained training feedback to allow the network to correctly capture this correlation. Then, a text-image affine combination is introduced, which adopts an affine transformation to combine both text and image features in the generation process, allowing the network to have a regional selection effect, which selectively fuse text-required image attributes into the generation pipeline and preserve text-irrelevant contents. Moreover, a semi-parametric memory-driven approach is proposed, which takes the advantages of both parametric and non-parametric techniques. The non-parametric component is a memory bank of stored pre-processed information constructed from the training dataset, and the parametric component is a neural network. By doing this, the parametric approaches can enable the benefits of end-to-end training of highly expressive models, and the non-parametric techniques allow the network to make full use of large datasets of data at inference time.
Third, two solutions are presented to relieve the cost of computing resources required by a network having inputs from different modalities, allowing the network to be easily implemented in various areas. More specifically, we improve the capabilities of both generator and discriminator in conditional GANs to avoid blindly increasing the number of trainable parameters of a network, and construct a single-directional discriminator by combining two training goals (i.e., having better image quality and text-image semantic alignment) into a single direction (i.e., improving the quality of fusion features) to reduce redundancies in conditional GANs.
This work paves the way for building a lightweight framework to achieve an effective interaction between multi-modal information, and also to be easily deployed in various real-world applications.
|