Inducing high energy-latency of large vision-language models with verbose images

Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency tim...

Full description

Bibliographic Details
Main Authors: Gao, K, Bai, Y, Gu, J, Xia, ST, Torr, P, Li, Z, Liu, W
Format: Conference item
Language:English
Published: OpenReview 2024
_version_ 1811140410876100608
author Gao, K
Bai, Y
Gu, J
Xia, ST
Torr, P
Li, Z
Liu, W
author_facet Gao, K
Bai, Y
Gu, J
Xia, ST
Torr, P
Li, Z
Liu, W
author_sort Gao, K
collection OXFORD
description Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during inference of VLMs, it will exhaust computational resources. In this paper, we explore this attack surface about availability of VLMs and aim to induce high energy-latency cost during inference of VLMs. We find that high energy-latency cost during inference of VLMs can be manipulated by maximizing the length of generated sequences. To this end, we propose verbose images, with the goal of crafting an imperceptible perturbation to induce VLMs to generate long sentences during inference. Concretely, we design three loss objectives. First, a loss is proposed to delay the occurrence of end-of-sequence (EOS) token, where EOS token is a signal for VLMs to stop generating further tokens. Moreover, an uncertainty loss and a token diversity loss are proposed to increase the uncertainty over each generated token and the diversity among all tokens of the whole generated sequence, respectively, which can break output dependency at token-level and sequence-level. Furthermore, a temporal weight adjustment algorithm is proposed, which can effectively balance these losses. Extensive experiments demonstrate that our verbose images can increase the length of generated sequences by 7.87× and 8.56× compared to original images on MS-COCO and ImageNet datasets, which presents potential challenges for various applications. Our code is available at https://github.com/KuofengGao/Verbose_Images.
first_indexed 2024-09-25T04:21:33Z
format Conference item
id oxford-uuid:13fbdb09-43e3-44f0-a42c-70c26e0b705e
institution University of Oxford
language English
last_indexed 2024-09-25T04:21:33Z
publishDate 2024
publisher OpenReview
record_format dspace
spelling oxford-uuid:13fbdb09-43e3-44f0-a42c-70c26e0b705e2024-08-09T09:59:22ZInducing high energy-latency of large vision-language models with verbose imagesConference itemhttp://purl.org/coar/resource_type/c_5794uuid:13fbdb09-43e3-44f0-a42c-70c26e0b705eEnglishSymplectic ElementsOpenReview2024Gao, KBai, YGu, JXia, STTorr, PLi, ZLiu, WLarge vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during inference of VLMs, it will exhaust computational resources. In this paper, we explore this attack surface about availability of VLMs and aim to induce high energy-latency cost during inference of VLMs. We find that high energy-latency cost during inference of VLMs can be manipulated by maximizing the length of generated sequences. To this end, we propose verbose images, with the goal of crafting an imperceptible perturbation to induce VLMs to generate long sentences during inference. Concretely, we design three loss objectives. First, a loss is proposed to delay the occurrence of end-of-sequence (EOS) token, where EOS token is a signal for VLMs to stop generating further tokens. Moreover, an uncertainty loss and a token diversity loss are proposed to increase the uncertainty over each generated token and the diversity among all tokens of the whole generated sequence, respectively, which can break output dependency at token-level and sequence-level. Furthermore, a temporal weight adjustment algorithm is proposed, which can effectively balance these losses. Extensive experiments demonstrate that our verbose images can increase the length of generated sequences by 7.87× and 8.56× compared to original images on MS-COCO and ImageNet datasets, which presents potential challenges for various applications. Our code is available at https://github.com/KuofengGao/Verbose_Images.
spellingShingle Gao, K
Bai, Y
Gu, J
Xia, ST
Torr, P
Li, Z
Liu, W
Inducing high energy-latency of large vision-language models with verbose images
title Inducing high energy-latency of large vision-language models with verbose images
title_full Inducing high energy-latency of large vision-language models with verbose images
title_fullStr Inducing high energy-latency of large vision-language models with verbose images
title_full_unstemmed Inducing high energy-latency of large vision-language models with verbose images
title_short Inducing high energy-latency of large vision-language models with verbose images
title_sort inducing high energy latency of large vision language models with verbose images
work_keys_str_mv AT gaok inducinghighenergylatencyoflargevisionlanguagemodelswithverboseimages
AT baiy inducinghighenergylatencyoflargevisionlanguagemodelswithverboseimages
AT guj inducinghighenergylatencyoflargevisionlanguagemodelswithverboseimages
AT xiast inducinghighenergylatencyoflargevisionlanguagemodelswithverboseimages
AT torrp inducinghighenergylatencyoflargevisionlanguagemodelswithverboseimages
AT liz inducinghighenergylatencyoflargevisionlanguagemodelswithverboseimages
AT liuw inducinghighenergylatencyoflargevisionlanguagemodelswithverboseimages