Energy-latency manipulation of multi-modal large language models via verbose samples

Despite the exceptional performance of multi-modal large language models (MLLMs), their deployment requires substantial computational resources. Once malicious users induce high energy consumption and latency time (energy-latency cost), it will exhaust computational resources and harm availability o...

Full description

Bibliographic Details
Main Authors:	Gao, K, Gu, J, Bai, Y, Xia, S-T, Torr, P, Liu, W, Li, Z
Format:	Conference item
Language:	English
Published:	2024

_version_	1811140475871035392
author	Gao, K Gu, J Bai, Y Xia, S-T Torr, P Liu, W Li, Z
author_facet	Gao, K Gu, J Bai, Y Xia, S-T Torr, P Liu, W Li, Z
author_sort	Gao, K
collection	OXFORD
description	Despite the exceptional performance of multi-modal large language models (MLLMs), their deployment requires substantial computational resources. Once malicious users induce high energy consumption and latency time (energy-latency cost), it will exhaust computational resources and harm availability of service. In this paper, we investigate this vulnerability for MLLMs, particularly image-based and video-based ones, and aim to induce high energy-latency cost during inference by crafting an imperceptible perturbation. We find that high energy-latency cost can be manipulated by maximizing the length of generated sequences, which motivates us to propose verbose samples, including verbose images and videos. Concretely, two modality non-specific losses are proposed, including a loss to delay end-of-sequence (EOS) token and an uncertainty loss to increase the uncertainty over each generated token. In addition, improving diversity is important to encourage longer responses by increasing the complexity, which inspires the following modality specific loss. For verbose images, a token diversity loss is proposed to promote diverse hidden states. For verbose videos, a frame feature diversity loss is proposed to increase the feature diversity among frames. To balance these losses, we propose a temporal weight adjustment algorithm. Experiments demonstrate that our verbose samples can largely extend the length of generated sequences.
first_indexed	2024-09-25T04:22:35Z
format	Conference item
id	oxford-uuid:bae15da7-2111-41bc-8ad7-f611ed48fd49
institution	University of Oxford
language	English
last_indexed	2024-09-25T04:22:35Z
publishDate	2024
record_format	dspace
spelling	oxford-uuid:bae15da7-2111-41bc-8ad7-f611ed48fd492024-08-12T12:26:12ZEnergy-latency manipulation of multi-modal large language models via verbose samplesConference itemhttp://purl.org/coar/resource_type/c_5794uuid:bae15da7-2111-41bc-8ad7-f611ed48fd49EnglishSymplectic Elements2024Gao, KGu, JBai, YXia, S-TTorr, PLiu, WLi, ZDespite the exceptional performance of multi-modal large language models (MLLMs), their deployment requires substantial computational resources. Once malicious users induce high energy consumption and latency time (energy-latency cost), it will exhaust computational resources and harm availability of service. In this paper, we investigate this vulnerability for MLLMs, particularly image-based and video-based ones, and aim to induce high energy-latency cost during inference by crafting an imperceptible perturbation. We find that high energy-latency cost can be manipulated by maximizing the length of generated sequences, which motivates us to propose verbose samples, including verbose images and videos. Concretely, two modality non-specific losses are proposed, including a loss to delay end-of-sequence (EOS) token and an uncertainty loss to increase the uncertainty over each generated token. In addition, improving diversity is important to encourage longer responses by increasing the complexity, which inspires the following modality specific loss. For verbose images, a token diversity loss is proposed to promote diverse hidden states. For verbose videos, a frame feature diversity loss is proposed to increase the feature diversity among frames. To balance these losses, we propose a temporal weight adjustment algorithm. Experiments demonstrate that our verbose samples can largely extend the length of generated sequences.
spellingShingle	Gao, K Gu, J Bai, Y Xia, S-T Torr, P Liu, W Li, Z Energy-latency manipulation of multi-modal large language models via verbose samples
title	Energy-latency manipulation of multi-modal large language models via verbose samples
title_full	Energy-latency manipulation of multi-modal large language models via verbose samples
title_fullStr	Energy-latency manipulation of multi-modal large language models via verbose samples
title_full_unstemmed	Energy-latency manipulation of multi-modal large language models via verbose samples
title_short	Energy-latency manipulation of multi-modal large language models via verbose samples
title_sort	energy latency manipulation of multi modal large language models via verbose samples
work_keys_str_mv	AT gaok energylatencymanipulationofmultimodallargelanguagemodelsviaverbosesamples AT guj energylatencymanipulationofmultimodallargelanguagemodelsviaverbosesamples AT baiy energylatencymanipulationofmultimodallargelanguagemodelsviaverbosesamples AT xiast energylatencymanipulationofmultimodallargelanguagemodelsviaverbosesamples AT torrp energylatencymanipulationofmultimodallargelanguagemodelsviaverbosesamples AT liuw energylatencymanipulationofmultimodallargelanguagemodelsviaverbosesamples AT liz energylatencymanipulationofmultimodallargelanguagemodelsviaverbosesamples

Energy-latency manipulation of multi-modal large language models via verbose samples

Similar Items