No Fine-Tuning, No Cry: Robust SVD for Compressing Deep Networks

A common technique for compressing a neural network is to compute the <i>k</i>-rank <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>ℓ</mi><mn>2</mn></msub><...

Full description

Bibliographic Details
Main Authors:	Murad Tukan, Alaa Maalouf, Matan Weksler, Dan Feldman
Format:	Article
Language:	English
Published:	MDPI AG 2021-08-01
Series:	Sensors
Subjects:	matrix factorization neural networks compression robust low rank approximation Löwner ellipsoid
Online Access:	https://www.mdpi.com/1424-8220/21/16/5599

_version_	1797522053206638592
author	Murad Tukan Alaa Maalouf Matan Weksler Dan Feldman
author_facet	Murad Tukan Alaa Maalouf Matan Weksler Dan Feldman
author_sort	Murad Tukan
collection	DOAJ
description	A common technique for compressing a neural network is to compute the <i>k</i>-rank <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>ℓ</mi><mn>2</mn></msub></semantics></math></inline-formula> approximation <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>A</mi><mi>k</mi></msub></semantics></math></inline-formula> of the matrix <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>A</mi><mo>∈</mo><msup><mi mathvariant="double-struck">R</mi><mrow><mi>n</mi><mo>×</mo><mi>d</mi></mrow></msup></mrow></semantics></math></inline-formula> via SVD that corresponds to a fully connected layer (or embedding layer). Here, <i>d</i> is the number of input neurons in the layer, <i>n</i> is the number in the next one, and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>A</mi><mi>k</mi></msub></semantics></math></inline-formula> is stored in <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>O</mi><mo>(</mo><mo>(</mo><mi>n</mi><mo>+</mo><mi>d</mi><mo>)</mo><mi>k</mi><mo>)</mo></mrow></semantics></math></inline-formula> memory instead of <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>O</mi><mo>(</mo><mi>n</mi><mi>d</mi><mo>)</mo></mrow></semantics></math></inline-formula>. Then, a fine-tuning step is used to improve this initial compression. However, end users may not have the required computation resources, time, or budget to run this fine-tuning stage. Furthermore, the original training set may not be available. In this paper, we provide an algorithm for compressing neural networks using a similar initial compression time (to common techniques) but without the fine-tuning step. The main idea is replacing the <i>k</i>-rank <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>ℓ</mi><mn>2</mn></msub></semantics></math></inline-formula> approximation with <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>ℓ</mi><mi>p</mi></msub></semantics></math></inline-formula>, for <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>p</mi><mo>∈</mo><mo>[</mo><mn>1</mn><mo>,</mo><mn>2</mn><mo>]</mo></mrow></semantics></math></inline-formula>, which is known to be less sensitive to outliers but much harder to compute. Our main technical result is a practical and provable approximation algorithm to compute it for any <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>p</mi><mo>≥</mo><mn>1</mn></mrow></semantics></math></inline-formula>, based on modern techniques in computational geometry. Extensive experimental results on the GLUE benchmark for compressing the networks BERT, DistilBERT, XLNet, and RoBERTa confirm this theoretical advantage.
first_indexed	2024-03-10T08:23:07Z
format	Article
id	doaj.art-18b56702a14f476393df979512e7125a
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-10T08:23:07Z
publishDate	2021-08-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-18b56702a14f476393df979512e7125a2023-11-22T09:42:32ZengMDPI AGSensors1424-82202021-08-012116559910.3390/s21165599No Fine-Tuning, No Cry: Robust SVD for Compressing Deep NetworksMurad Tukan0Alaa Maalouf1Matan Weksler2Dan Feldman3The Robotics and Big Data Lab, Department of Computer Science, University of Haifa, Haifa 3498838, IsraelThe Robotics and Big Data Lab, Department of Computer Science, University of Haifa, Haifa 3498838, IsraelSamsung Research Israel, Herzliya 4659071, IsraelThe Robotics and Big Data Lab, Department of Computer Science, University of Haifa, Haifa 3498838, IsraelA common technique for compressing a neural network is to compute the <i>k</i>-rank <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>ℓ</mi><mn>2</mn></msub></semantics></math></inline-formula> approximation <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>A</mi><mi>k</mi></msub></semantics></math></inline-formula> of the matrix <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>A</mi><mo>∈</mo><msup><mi mathvariant="double-struck">R</mi><mrow><mi>n</mi><mo>×</mo><mi>d</mi></mrow></msup></mrow></semantics></math></inline-formula> via SVD that corresponds to a fully connected layer (or embedding layer). Here, <i>d</i> is the number of input neurons in the layer, <i>n</i> is the number in the next one, and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>A</mi><mi>k</mi></msub></semantics></math></inline-formula> is stored in <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>O</mi><mo>(</mo><mo>(</mo><mi>n</mi><mo>+</mo><mi>d</mi><mo>)</mo><mi>k</mi><mo>)</mo></mrow></semantics></math></inline-formula> memory instead of <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>O</mi><mo>(</mo><mi>n</mi><mi>d</mi><mo>)</mo></mrow></semantics></math></inline-formula>. Then, a fine-tuning step is used to improve this initial compression. However, end users may not have the required computation resources, time, or budget to run this fine-tuning stage. Furthermore, the original training set may not be available. In this paper, we provide an algorithm for compressing neural networks using a similar initial compression time (to common techniques) but without the fine-tuning step. The main idea is replacing the <i>k</i>-rank <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>ℓ</mi><mn>2</mn></msub></semantics></math></inline-formula> approximation with <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>ℓ</mi><mi>p</mi></msub></semantics></math></inline-formula>, for <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>p</mi><mo>∈</mo><mo>[</mo><mn>1</mn><mo>,</mo><mn>2</mn><mo>]</mo></mrow></semantics></math></inline-formula>, which is known to be less sensitive to outliers but much harder to compute. Our main technical result is a practical and provable approximation algorithm to compute it for any <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>p</mi><mo>≥</mo><mn>1</mn></mrow></semantics></math></inline-formula>, based on modern techniques in computational geometry. Extensive experimental results on the GLUE benchmark for compressing the networks BERT, DistilBERT, XLNet, and RoBERTa confirm this theoretical advantage.https://www.mdpi.com/1424-8220/21/16/5599matrix factorizationneural networks compressionrobust low rank approximationLöwner ellipsoid
spellingShingle	Murad Tukan Alaa Maalouf Matan Weksler Dan Feldman No Fine-Tuning, No Cry: Robust SVD for Compressing Deep Networks Sensors matrix factorization neural networks compression robust low rank approximation Löwner ellipsoid
title	No Fine-Tuning, No Cry: Robust SVD for Compressing Deep Networks
title_full	No Fine-Tuning, No Cry: Robust SVD for Compressing Deep Networks
title_fullStr	No Fine-Tuning, No Cry: Robust SVD for Compressing Deep Networks
title_full_unstemmed	No Fine-Tuning, No Cry: Robust SVD for Compressing Deep Networks
title_short	No Fine-Tuning, No Cry: Robust SVD for Compressing Deep Networks
title_sort	no fine tuning no cry robust svd for compressing deep networks
topic	matrix factorization neural networks compression robust low rank approximation Löwner ellipsoid
url	https://www.mdpi.com/1424-8220/21/16/5599
work_keys_str_mv	AT muradtukan nofinetuningnocryrobustsvdforcompressingdeepnetworks AT alaamaalouf nofinetuningnocryrobustsvdforcompressingdeepnetworks AT matanweksler nofinetuningnocryrobustsvdforcompressingdeepnetworks AT danfeldman nofinetuningnocryrobustsvdforcompressingdeepnetworks

No Fine-Tuning, No Cry: Robust SVD for Compressing Deep Networks

Similar Items