Typographic Attacks on Vision-LLMs: Evaluating Adversarial Threats in Autonomous Driving Systems

Typographic Attacks on Vision-LLMs: Evaluating Adversarial Threats in Autonomous Driving Systems


Abstract and 1. Introduction

  1. Related Work

    2.1 Vision-LLMs

    2.2 Transferable Adversarial Attacks

  2. Preliminaries

    3.1 Revisiting Auto-Regressive Vision-LLMs

    3.2 Typographic Attacks in Vision-LLMs-based AD Systems

  3. Methodology

    4.1 Auto-Generation of Typographic Attack

    4.2 Augmentations of Typographic Attack

    4.3 Realizations of Typographic Attacks

  4. Experiments

  5. Conclusion and References

Abstract

Vision-Large-Language-Models (Vision-LLMs) are increasingly being integrated into autonomous driving (AD) systems due to their advanced visual-language reasoning capabilities, targeting the perception, prediction, planning, and control mechanisms. However, Vision-LLMs have demonstrated susceptibilities against various types of adversarial attacks, which would compromise their reliability and safety. To further explore the risk in AD systems and the transferability of practical threats, we propose to leverage typographic attacks against AD systems relying on the decision-making capabilities of Vision-LLMs. Different from the few existing works developing general datasets of typographic attacks, this paper focuses on realistic traffic scenarios where these attacks can be deployed, on their potential effects on the decision-making autonomy, and on the practical ways in which these attacks can be physically presented. To achieve the above goals, we first propose a dataset-agnostic framework for automatically generating false answers that can mislead Vision-LLMs’ reasoning. Then, we present a linguistic augmentation scheme that facilitates attacks at image-level and region-level reasoning, and we extend it with attack patterns against multiple reasoning tasks simultaneously. Based on these, we conduct a study on how these attacks can be realized in physical traffic scenarios. Through our empirical study, we evaluate the effectiveness, transferability, and realizability of typographic attacks in traffic scenes. Our findings demonstrate particular harmfulness of the typographic attacks against existing Vision-LLMs (e.g., LLaVA, Qwen-VL, VILA, and Imp), thereby raising community awareness of vulnerabilities when incorporating such models into AD systems. We will release our source code upon acceptance.

1 Introduction

Vision-Language Large Models (Vision-LLMs) have seen rapid development over the recent years [1, 2, 3], and their incorporation into autonomous driving (AD) systems have been seriously considered by both industry and academia [4, 5, 6, 7, 8, 9]. The integration of Vision-LLMs into AD systems showcases their ability to convey explicit reasoning steps to road users on the fly and satisfy the need for textual justifications of traffic scenarios regarding perception, prediction, planning, and control, particularly in safety-critical circumstances in the physical world. The core strength of VisionLLMs lies in their auto-regressive capabilities through large-scale pretraining with visual-language alignment [1], making them even able to perform zero-shot optical character recognition, grounded reasoning, visual-question answering, visual-language reasoning, etc. Nevertheless, despite their impressive capabilities, Vision-LLMs are unfortunately not impervious against adversarial attacks that can misdirect the reasoning processes [10]. Any successful attack strategies have the potential to pose critical problems when deploying Vision-LLMs in AD systems, especially those that may even bypass the models’ black-box characteristics. As a step towards their reliable adoption in AD, studying the transferability of adversarial attacks is crucial to raising awareness of practical threats against deployed Vision-LLMs, and to efforts in building appropriate defense strategies for them.

\
In this work, we revisit the shared auto-regressive characteristic of different Vision-LLMs and intuitively turn that strength into a weakness by leveraging typographic forms of adversarial attacks, also known as typographic attacks. Typographic attacks were first studied in the context of the well-known Contrastive Language-Image Pre-training (CLIP) model [11, 12]. Early works in this area focused on developing a general typographic attack dataset targeting multiple-choice answering (such as object recognition, visual attribute detection, and commonsense answering) and enumeration [13]. Researchers also explored multiple-choice self-generating attacks against zero-shot classification [14], and proposed several defense mechanisms, including keyword-training [15] and prompting the model for detailed reasoning [16]. Despite these initial efforts, the methodologies have neither seen a comprehensive attack framework nor been explicitly designed to investigate the impact of typographic attacks on safety-critical systems, particularly those in AD scenarios.

\
Our work aims to fill this research gap by studying typographic attacks from the perspective of AD systems that incorporate Vision-LLMs. In summary, our scientific contributions are threefold:

\
• Dataset-Independent Framework: we introduce a dataset-independent framework designed to automatically generate misleading answers that can disrupt the reasoning processes of Vision-Large Language Models (Vision-LLMs).

\
• Linguistic Augmentation Schemes: we develop a linguistic augmentation scheme aimed at facilitating stronger typographic attacks on Vision-LLMs. This scheme targets reasoning at both the image and region levels and is expandable to multiple reasoning tasks simultaneously.

\
• Empirical Study in Semi-Realistic Scenarios: we conduct a study to explore the possible implementations of these attacks in real-world traffic scenarios.

\
Through our empirical study of typographic attacks in traffic scenes, we hope to raise community awareness of critical typographic vulnerabilities when incorporating such models into AD systems.

\

:::info
This paper is available on arxiv under CC BY 4.0 DEED license.

:::

:::info
Authors:

(1) Nhat Chung, CFAR and IHPC, A*STAR, Singapore and VNU-HCM, Vietnam;

(2) Sensen Gao, CFAR and IHPC, A*STAR, Singapore and Nankai University, China;

(3) Tuan-Anh Vu, CFAR and IHPC, A*STAR, Singapore and HKUST, HKSAR;

(4) Jie Zhang, Nanyang Technological University, Singapore;

(5) Aishan Liu, Beihang University, China;

(6) Yun Lin, Shanghai Jiao Tong University, China;

(7) Jin Song Dong, National University of Singapore, Singapore;

(8) Qing Guo, CFAR and IHPC, A*STAR, Singapore and National University of Singapore, Singapore.

:::

\



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *