Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers

There are several attempts in vision transformers to reduce quadratic time complexity to linear time complexity according to increases in the number of tokens. Cross-covariance image transformers (XCiT) are also one of the techniques utilized to address the issue. However, despite these efforts, the...

Full description

Bibliographic Details
Main Authors: Jaesin Ahn , Jiuk Hong, Jeongwoo Ju, Heechul Jung
Format: Article
Language:English
Published: MDPI AG 2023-04-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/11/8/1933
_version_ 1797604474056867840
author Jaesin Ahn 
Jiuk Hong
Jeongwoo Ju
Heechul Jung
author_facet Jaesin Ahn 
Jiuk Hong
Jeongwoo Ju
Heechul Jung
author_sort Jaesin Ahn 
collection DOAJ
description There are several attempts in vision transformers to reduce quadratic time complexity to linear time complexity according to increases in the number of tokens. Cross-covariance image transformers (XCiT) are also one of the techniques utilized to address the issue. However, despite these efforts, the increase in token dimensions still results in quadratic growth in time complexity, and the dimension is a key parameter for achieving superior generalization performance. In this paper, a novel method is proposed to improve the generalization performances of XCiT models without increasing token dimensions. We redesigned the embedding layers of queries, keys, and values, such as separate non-linear embedding (SNE), partially-shared non-linear embedding (P-SNE), and fully-shared non-linear embedding (F-SNE). Finally, a proposed structure with different model size settings achieved <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>71.4</mn><mo>%</mo><mo>,</mo><mn>77.8</mn><mo>%</mo></mrow></semantics></math></inline-formula>, and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>82.1</mn><mo>%</mo></mrow></semantics></math></inline-formula> on ImageNet-1k compared with <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>69.9</mn><mo>%</mo><mo>,</mo><mn>77.1</mn><mo>%</mo></mrow></semantics></math></inline-formula>, and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>82.0</mn><mo>%</mo></mrow></semantics></math></inline-formula> acquired by the original XCiT models, namely XCiT-N12, XCiT-T12, and XCiT-S12, respectively. Additionally, the proposed model achieved <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>94.8</mn><mo>%</mo></mrow></semantics></math></inline-formula> in transfer learning experiments, on average, for CIFAR-10, CIFAR-100, Stanford Cars, and STL-10, which is superior to the baseline model of XCiT-S12 (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>94.5</mn><mo>%</mo></mrow></semantics></math></inline-formula>). In particular, the proposed models demonstrated considerable improvements on the out-of-distribution detection task compared to the original XCiT models.
first_indexed 2024-03-11T04:46:57Z
format Article
id doaj.art-a6ae042da3454302b43930d9ffe74ea7
institution Directory Open Access Journal
issn 2227-7390
language English
last_indexed 2024-03-11T04:46:57Z
publishDate 2023-04-01
publisher MDPI AG
record_format Article
series Mathematics
spelling doaj.art-a6ae042da3454302b43930d9ffe74ea72023-11-17T20:18:40ZengMDPI AGMathematics2227-73902023-04-01118193310.3390/math11081933Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image TransformersJaesin Ahn 0Jiuk Hong1Jeongwoo Ju2Heechul Jung3Department of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of KoreaDepartment of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of KoreaCaptos Co., Ltd., Yangsan 50652, Republic of KoreaDepartment of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of KoreaThere are several attempts in vision transformers to reduce quadratic time complexity to linear time complexity according to increases in the number of tokens. Cross-covariance image transformers (XCiT) are also one of the techniques utilized to address the issue. However, despite these efforts, the increase in token dimensions still results in quadratic growth in time complexity, and the dimension is a key parameter for achieving superior generalization performance. In this paper, a novel method is proposed to improve the generalization performances of XCiT models without increasing token dimensions. We redesigned the embedding layers of queries, keys, and values, such as separate non-linear embedding (SNE), partially-shared non-linear embedding (P-SNE), and fully-shared non-linear embedding (F-SNE). Finally, a proposed structure with different model size settings achieved <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>71.4</mn><mo>%</mo><mo>,</mo><mn>77.8</mn><mo>%</mo></mrow></semantics></math></inline-formula>, and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>82.1</mn><mo>%</mo></mrow></semantics></math></inline-formula> on ImageNet-1k compared with <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>69.9</mn><mo>%</mo><mo>,</mo><mn>77.1</mn><mo>%</mo></mrow></semantics></math></inline-formula>, and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>82.0</mn><mo>%</mo></mrow></semantics></math></inline-formula> acquired by the original XCiT models, namely XCiT-N12, XCiT-T12, and XCiT-S12, respectively. Additionally, the proposed model achieved <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>94.8</mn><mo>%</mo></mrow></semantics></math></inline-formula> in transfer learning experiments, on average, for CIFAR-10, CIFAR-100, Stanford Cars, and STL-10, which is superior to the baseline model of XCiT-S12 (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>94.5</mn><mo>%</mo></mrow></semantics></math></inline-formula>). In particular, the proposed models demonstrated considerable improvements on the out-of-distribution detection task compared to the original XCiT models.https://www.mdpi.com/2227-7390/11/8/1933vision transformerQ/K/V embeddingshared embeddingnon-linear embeddingimage classification
spellingShingle Jaesin Ahn 
Jiuk Hong
Jeongwoo Ju
Heechul Jung
Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers
Mathematics
vision transformer
Q/K/V embedding
shared embedding
non-linear embedding
image classification
title Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers
title_full Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers
title_fullStr Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers
title_full_unstemmed Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers
title_short Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers
title_sort redesigning embedding layers for queries keys and values in cross covariance image transformers
topic vision transformer
Q/K/V embedding
shared embedding
non-linear embedding
image classification
url https://www.mdpi.com/2227-7390/11/8/1933
work_keys_str_mv AT jaesinahn redesigningembeddinglayersforquerieskeysandvaluesincrosscovarianceimagetransformers
AT jiukhong redesigningembeddinglayersforquerieskeysandvaluesincrosscovarianceimagetransformers
AT jeongwooju redesigningembeddinglayersforquerieskeysandvaluesincrosscovarianceimagetransformers
AT heechuljung redesigningembeddinglayersforquerieskeysandvaluesincrosscovarianceimagetransformers