Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers

There are several attempts in vision transformers to reduce quadratic time complexity to linear time complexity according to increases in the number of tokens. Cross-covariance image transformers (XCiT) are also one of the techniques utilized to address the issue. However, despite these efforts, the...

Full description

Bibliographic Details
Main Authors:	Jaesin Ahn , Jiuk Hong, Jeongwoo Ju, Heechul Jung
Format:	Article
Language:	English
Published:	MDPI AG 2023-04-01
Series:	Mathematics
Subjects:	vision transformer Q/K/V embedding shared embedding non-linear embedding image classification
Online Access:	https://www.mdpi.com/2227-7390/11/8/1933

_version_	1797604474056867840
author	Jaesin Ahn Jiuk Hong Jeongwoo Ju Heechul Jung
author_facet	Jaesin Ahn Jiuk Hong Jeongwoo Ju Heechul Jung
author_sort	Jaesin Ahn
collection	DOAJ
description	There are several attempts in vision transformers to reduce quadratic time complexity to linear time complexity according to increases in the number of tokens. Cross-covariance image transformers (XCiT) are also one of the techniques utilized to address the issue. However, despite these efforts, the increase in token dimensions still results in quadratic growth in time complexity, and the dimension is a key parameter for achieving superior generalization performance. In this paper, a novel method is proposed to improve the generalization performances of XCiT models without increasing token dimensions. We redesigned the embedding layers of queries, keys, and values, such as separate non-linear embedding (SNE), partially-shared non-linear embedding (P-SNE), and fully-shared non-linear embedding (F-SNE). Finally, a proposed structure with different model size settings achieved <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>71.4</mn><mo>%</mo><mo>,</mo><mn>77.8</mn><mo>%</mo></mrow></semantics></math></inline-formula>, and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>82.1</mn><mo>%</mo></mrow></semantics></math></inline-formula> on ImageNet-1k compared with <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>69.9</mn><mo>%</mo><mo>,</mo><mn>77.1</mn><mo>%</mo></mrow></semantics></math></inline-formula>, and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>82.0</mn><mo>%</mo></mrow></semantics></math></inline-formula> acquired by the original XCiT models, namely XCiT-N12, XCiT-T12, and XCiT-S12, respectively. Additionally, the proposed model achieved <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>94.8</mn><mo>%</mo></mrow></semantics></math></inline-formula> in transfer learning experiments, on average, for CIFAR-10, CIFAR-100, Stanford Cars, and STL-10, which is superior to the baseline model of XCiT-S12 (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>94.5</mn><mo>%</mo></mrow></semantics></math></inline-formula>). In particular, the proposed models demonstrated considerable improvements on the out-of-distribution detection task compared to the original XCiT models.
first_indexed	2024-03-11T04:46:57Z
format	Article
id	doaj.art-a6ae042da3454302b43930d9ffe74ea7
institution	Directory Open Access Journal
issn	2227-7390
language	English
last_indexed	2024-03-11T04:46:57Z
publishDate	2023-04-01
publisher	MDPI AG
record_format	Article
series	Mathematics
spelling	doaj.art-a6ae042da3454302b43930d9ffe74ea72023-11-17T20:18:40ZengMDPI AGMathematics2227-73902023-04-01118193310.3390/math11081933Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image TransformersJaesin Ahn 0Jiuk Hong1Jeongwoo Ju2Heechul Jung3Department of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of KoreaDepartment of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of KoreaCaptos Co., Ltd., Yangsan 50652, Republic of KoreaDepartment of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of KoreaThere are several attempts in vision transformers to reduce quadratic time complexity to linear time complexity according to increases in the number of tokens. Cross-covariance image transformers (XCiT) are also one of the techniques utilized to address the issue. However, despite these efforts, the increase in token dimensions still results in quadratic growth in time complexity, and the dimension is a key parameter for achieving superior generalization performance. In this paper, a novel method is proposed to improve the generalization performances of XCiT models without increasing token dimensions. We redesigned the embedding layers of queries, keys, and values, such as separate non-linear embedding (SNE), partially-shared non-linear embedding (P-SNE), and fully-shared non-linear embedding (F-SNE). Finally, a proposed structure with different model size settings achieved <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>71.4</mn><mo>%</mo><mo>,</mo><mn>77.8</mn><mo>%</mo></mrow></semantics></math></inline-formula>, and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>82.1</mn><mo>%</mo></mrow></semantics></math></inline-formula> on ImageNet-1k compared with <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>69.9</mn><mo>%</mo><mo>,</mo><mn>77.1</mn><mo>%</mo></mrow></semantics></math></inline-formula>, and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>82.0</mn><mo>%</mo></mrow></semantics></math></inline-formula> acquired by the original XCiT models, namely XCiT-N12, XCiT-T12, and XCiT-S12, respectively. Additionally, the proposed model achieved <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>94.8</mn><mo>%</mo></mrow></semantics></math></inline-formula> in transfer learning experiments, on average, for CIFAR-10, CIFAR-100, Stanford Cars, and STL-10, which is superior to the baseline model of XCiT-S12 (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>94.5</mn><mo>%</mo></mrow></semantics></math></inline-formula>). In particular, the proposed models demonstrated considerable improvements on the out-of-distribution detection task compared to the original XCiT models.https://www.mdpi.com/2227-7390/11/8/1933vision transformerQ/K/V embeddingshared embeddingnon-linear embeddingimage classification
spellingShingle	Jaesin Ahn Jiuk Hong Jeongwoo Ju Heechul Jung Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers Mathematics vision transformer Q/K/V embedding shared embedding non-linear embedding image classification
title	Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers
title_full	Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers
title_fullStr	Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers
title_full_unstemmed	Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers
title_short	Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers
title_sort	redesigning embedding layers for queries keys and values in cross covariance image transformers
topic	vision transformer Q/K/V embedding shared embedding non-linear embedding image classification
url	https://www.mdpi.com/2227-7390/11/8/1933
work_keys_str_mv	AT jaesinahn redesigningembeddinglayersforquerieskeysandvaluesincrosscovarianceimagetransformers AT jiukhong redesigningembeddinglayersforquerieskeysandvaluesincrosscovarianceimagetransformers AT jeongwooju redesigningembeddinglayersforquerieskeysandvaluesincrosscovarianceimagetransformers AT heechuljung redesigningembeddinglayersforquerieskeysandvaluesincrosscovarianceimagetransformers

Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers

Similar Items