Behavior Control of Generative AI
We delve into the high-dimensional latent space of LLMs to achieve causal interpretability and steerability. Our research focuses on AI Control and Interpretability, developing representation engineering, latent activation steering, and online machine unlearning algorithms that verifiably align models with human intent.
AI Security & Privacy
We build defenses for systems operating in contested environments. To improve Adversarial Robustness, we design algorithms to mitigate training data poisoning and backdoor injection, protect interactive web agents from prompt injections, and secure continual learning streams against long-term model corruption.
AI Authenticity and Synthetic Media
We investigate the vulnerabilities of generative and classification models to ensure authenticity. We focus on benchmarking deepfake text and audio detectors, studying the linguistic sensitivity of audio LMs, and stress-testing reasoning pathways to build robust detection models against adaptive, AI-driven manipulation.
Explainable AI (XAI)
We advance the transparency, stability, and reliability of machine learning models. Our work focuses on black-box explanation methods, counterfactual/alterfactual reasoning, and continuous verification of automated explanations, addressing the need for operational interpretability that remains resilient under adversarial diagnostics.