Secure & Trustworthy AI

When your model
becomes the attack surface

A running series on adversarial machine learning, privacy, and what it actually takes to deploy AI systems you can trust.

All Posts
Fairness Is Not One Number
Three incompatible fairness definitions. A court ruling. And why you can't satisfy them all at once — a mathematical impossibility, not an engineering problem.
Jailbreaking LLMs: The Attack Taxonomy Nobody Agrees On
From hand-crafted prompts to RL-driven token search and embedding space attacks. NeurIPS 2024 research shows the surface is bigger, and the defenses less set...
Certified Robustness: Proofs Instead of Benchmarks
Adversarial training only certifies what you tested. Certified defenses prove no adversarial example can exist within a given radius. The gap between the two...
PAC Privacy: A Better Way to Measure What Your Model Actually Leaks
Differential privacy has a painful problem: the privacy-utility tradeoff is severe, especially for large models. PAC Privacy is a newer framework that asks a...
Differential Privacy: What a Formal Guarantee Actually Looks Like
Most privacy mechanisms are heuristic. Differential privacy is different — it's a rigorous mathematical definition of what it means for a computation to leak...
Your Model Remembers Things It Shouldn't
A trained model isn't just a classifier — it's a compressed version of its training data. That compression is lossy, but not lossy enough. Privacy attacks ex...
Evasion Attacks and the Geometry of Adversarial Examples
Add an imperceptible noise pattern to an image, and a state-of-the-art classifier confidently misidentifies it. This isn't a bug in one model — it's somethin...
Explainability: Why Knowing Isn't the Same as Understanding
SHAP, saliency maps, counterfactuals. The tools exist. Whether they actually explain anything — or just make us feel better — is a different question entirely.
Poisoning Attacks: Corrupting a Model Before It Exists
The most elegant class of adversarial attacks don't touch the model at all — they corrupt the data it learns from. By the time you notice, the attack is alre...
The Threat Model Nobody Talks About in Intro ML
You spend months learning how to train a model. Nobody tells you that a well-trained model is also a well-optimized attack surface. Here's what that actually...