Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers
There are several attempts in vision transformers to reduce quadratic time complexity to linear time complexity according to increases in the number of tokens. Cross-covariance image transformers (XCiT) are also one of the techniques utilized to address the issue. However, despite these efforts, the...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-04-01
|
Series: | Mathematics |
Subjects: | |
Online Access: | https://www.mdpi.com/2227-7390/11/8/1933 |
_version_ | 1797604474056867840 |
---|---|
author | Jaesin Ahn Jiuk Hong Jeongwoo Ju Heechul Jung |
author_facet | Jaesin Ahn Jiuk Hong Jeongwoo Ju Heechul Jung |
author_sort | Jaesin Ahn |
collection | DOAJ |
description | There are several attempts in vision transformers to reduce quadratic time complexity to linear time complexity according to increases in the number of tokens. Cross-covariance image transformers (XCiT) are also one of the techniques utilized to address the issue. However, despite these efforts, the increase in token dimensions still results in quadratic growth in time complexity, and the dimension is a key parameter for achieving superior generalization performance. In this paper, a novel method is proposed to improve the generalization performances of XCiT models without increasing token dimensions. We redesigned the embedding layers of queries, keys, and values, such as separate non-linear embedding (SNE), partially-shared non-linear embedding (P-SNE), and fully-shared non-linear embedding (F-SNE). Finally, a proposed structure with different model size settings achieved <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>71.4</mn><mo>%</mo><mo>,</mo><mn>77.8</mn><mo>%</mo></mrow></semantics></math></inline-formula>, and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>82.1</mn><mo>%</mo></mrow></semantics></math></inline-formula> on ImageNet-1k compared with <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>69.9</mn><mo>%</mo><mo>,</mo><mn>77.1</mn><mo>%</mo></mrow></semantics></math></inline-formula>, and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>82.0</mn><mo>%</mo></mrow></semantics></math></inline-formula> acquired by the original XCiT models, namely XCiT-N12, XCiT-T12, and XCiT-S12, respectively. Additionally, the proposed model achieved <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>94.8</mn><mo>%</mo></mrow></semantics></math></inline-formula> in transfer learning experiments, on average, for CIFAR-10, CIFAR-100, Stanford Cars, and STL-10, which is superior to the baseline model of XCiT-S12 (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>94.5</mn><mo>%</mo></mrow></semantics></math></inline-formula>). In particular, the proposed models demonstrated considerable improvements on the out-of-distribution detection task compared to the original XCiT models. |
first_indexed | 2024-03-11T04:46:57Z |
format | Article |
id | doaj.art-a6ae042da3454302b43930d9ffe74ea7 |
institution | Directory Open Access Journal |
issn | 2227-7390 |
language | English |
last_indexed | 2024-03-11T04:46:57Z |
publishDate | 2023-04-01 |
publisher | MDPI AG |
record_format | Article |
series | Mathematics |
spelling | doaj.art-a6ae042da3454302b43930d9ffe74ea72023-11-17T20:18:40ZengMDPI AGMathematics2227-73902023-04-01118193310.3390/math11081933Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image TransformersJaesin Ahn 0Jiuk Hong1Jeongwoo Ju2Heechul Jung3Department of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of KoreaDepartment of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of KoreaCaptos Co., Ltd., Yangsan 50652, Republic of KoreaDepartment of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of KoreaThere are several attempts in vision transformers to reduce quadratic time complexity to linear time complexity according to increases in the number of tokens. Cross-covariance image transformers (XCiT) are also one of the techniques utilized to address the issue. However, despite these efforts, the increase in token dimensions still results in quadratic growth in time complexity, and the dimension is a key parameter for achieving superior generalization performance. In this paper, a novel method is proposed to improve the generalization performances of XCiT models without increasing token dimensions. We redesigned the embedding layers of queries, keys, and values, such as separate non-linear embedding (SNE), partially-shared non-linear embedding (P-SNE), and fully-shared non-linear embedding (F-SNE). Finally, a proposed structure with different model size settings achieved <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>71.4</mn><mo>%</mo><mo>,</mo><mn>77.8</mn><mo>%</mo></mrow></semantics></math></inline-formula>, and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>82.1</mn><mo>%</mo></mrow></semantics></math></inline-formula> on ImageNet-1k compared with <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>69.9</mn><mo>%</mo><mo>,</mo><mn>77.1</mn><mo>%</mo></mrow></semantics></math></inline-formula>, and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>82.0</mn><mo>%</mo></mrow></semantics></math></inline-formula> acquired by the original XCiT models, namely XCiT-N12, XCiT-T12, and XCiT-S12, respectively. Additionally, the proposed model achieved <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>94.8</mn><mo>%</mo></mrow></semantics></math></inline-formula> in transfer learning experiments, on average, for CIFAR-10, CIFAR-100, Stanford Cars, and STL-10, which is superior to the baseline model of XCiT-S12 (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>94.5</mn><mo>%</mo></mrow></semantics></math></inline-formula>). In particular, the proposed models demonstrated considerable improvements on the out-of-distribution detection task compared to the original XCiT models.https://www.mdpi.com/2227-7390/11/8/1933vision transformerQ/K/V embeddingshared embeddingnon-linear embeddingimage classification |
spellingShingle | Jaesin Ahn Jiuk Hong Jeongwoo Ju Heechul Jung Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers Mathematics vision transformer Q/K/V embedding shared embedding non-linear embedding image classification |
title | Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers |
title_full | Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers |
title_fullStr | Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers |
title_full_unstemmed | Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers |
title_short | Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers |
title_sort | redesigning embedding layers for queries keys and values in cross covariance image transformers |
topic | vision transformer Q/K/V embedding shared embedding non-linear embedding image classification |
url | https://www.mdpi.com/2227-7390/11/8/1933 |
work_keys_str_mv | AT jaesinahn redesigningembeddinglayersforquerieskeysandvaluesincrosscovarianceimagetransformers AT jiukhong redesigningembeddinglayersforquerieskeysandvaluesincrosscovarianceimagetransformers AT jeongwooju redesigningembeddinglayersforquerieskeysandvaluesincrosscovarianceimagetransformers AT heechuljung redesigningembeddinglayersforquerieskeysandvaluesincrosscovarianceimagetransformers |