Publications

Research papers and ongoing work in adversarial ML, AI safety, and mechanistic interpretability.

Under Review2026

Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion LMs

Arth Singh

ACL TrustNLP 2026

We demonstrate that diffusion-based language models exhibit a fundamental vulnerability: the denoising process is irreversible, and strategic re-masking of intermediate tokens can redirect generation toward adversarial outputs.

Adversarial MLDiffusion ModelsNLP
Under Review2026

Mechanistically Interpreting Compression in Vision-Language Models

Veeraraju Elluru, Arth Singh, Roberto Aguero, Ajay Agarwal, Debojyoti Das, Hreetam Paul

ACL SRW 2026 (under review) · arXiv preprint

We apply mechanistic interpretability techniques to understand how vision-language models compress visual information, identifying key circuits and attention patterns.

Mechanistic InterpretabilityVLMsCompression
Ongoing2026

The Readability vs. Controllability Gap: Rethinking Where Safety Lives in LLMs

Arth Singh

ICML Mechanistic Interpretability Workshop 2026 (target)

Safety features in LLMs are most readable in final layers but most controllable much earlier — a gap of 16-47% of model depth. A dual-level attack exploiting this gap achieves 92-97% ASR across four model families.

Mechanistic InterpretabilityAI SafetyLLM Internals
Ongoing2026

SENTINEL: A Game-Theoretic Framework for Measuring Meaningful Human Control in Autonomous Weapons

Arth Singh

ICML AI4Good Workshop 2026 (target)

A game-theoretic framework for quantifying meaningful human control in autonomous weapons. Optimal human review is 5-7s; rushed review (<2s) is worse than full autonomy. Adversarial manipulation cuts compliant policies from 11.9% to 5.1%.

AI GovernanceGame TheoryAutonomous Systems
Ongoing2026

GART: Mobile GUI Agent Red Teaming

Kyochul Jang, Arth Singh, et al.

NeurIPS 2026 (target)

A systematic red-teaming framework for evaluating the robustness and safety of mobile GUI agents against adversarial attacks.

Red TeamingGUI AgentsMobile
Preprint2026

Does Moral Reasoning Training Help or Hurt? Red-Teaming RL-Trained Ethical Agents

Arth Singh

Preprint

We red-team language models fine-tuned with reinforcement learning for moral reasoning, finding that ethical training can paradoxically create new attack surfaces.

Red TeamingRLAI Ethics
Ongoing2026

EGOx: A Novel VLM Benchmark for Physical AI Safety

Arth Singh, et al.

Physical AI Workshop @ IJCAI 2026 (target)

Building a novel VLM benchmark and automated annotation pipeline for evaluating physical AI safety — measuring how vision-language models understand and reason about hazards in real-world environments.

Physical AIVLM BenchmarkSafety