Jan 2026
Fairness Is Not One Number
Three incompatible fairness definitions. A court ruling. And why you can't satisfy them all at once — a mathematical impossibility, not an engineering problem.
fairness
ethics
bias
NeurIPS-2024
Jan 2026
Jailbreaking LLMs: The Attack Taxonomy Nobody Agrees On
From hand-crafted prompts to RL-driven token search and embedding space attacks. NeurIPS 2024 research shows the surface is bigger, and the defenses less set...
attacks
LLMs
jailbreak
NeurIPS-2024
Dec 2025
Certified Robustness: Proofs Instead of Benchmarks
Adversarial training only certifies what you tested. Certified defenses prove no adversarial example can exist within a given radius. The gap between the two...
defenses
certification
randomized-smoothing
NeurIPS-2024
Dec 2025
PAC Privacy: A Better Way to Measure What Your Model Actually Leaks
Differential privacy has a painful problem: the privacy-utility tradeoff is severe, especially for large models. PAC Privacy is a newer framework that asks a...
privacy
PAC-privacy
theory
differential-privacy
Dec 2025
Differential Privacy: What a Formal Guarantee Actually Looks Like
Most privacy mechanisms are heuristic. Differential privacy is different — it's a rigorous mathematical definition of what it means for a computation to leak...
privacy
differential-privacy
DP-SGD
theory
Nov 2025
Your Model Remembers Things It Shouldn't
A trained model isn't just a classifier — it's a compressed version of its training data. That compression is lossy, but not lossy enough. Privacy attacks ex...
privacy
membership-inference
model-inversion
attacks
Nov 2025
Evasion Attacks and the Geometry of Adversarial Examples
Add an imperceptible noise pattern to an image, and a state-of-the-art classifier confidently misidentifies it. This isn't a bug in one model — it's somethin...
attacks
evasion
FGSM
gradient
Nov 2025
Explainability: Why Knowing Isn't the Same as Understanding
SHAP, saliency maps, counterfactuals. The tools exist. Whether they actually explain anything — or just make us feel better — is a different question entirely.
explainability
SHAP
XAI
saliency
Nov 2025
Poisoning Attacks: Corrupting a Model Before It Exists
The most elegant class of adversarial attacks don't touch the model at all — they corrupt the data it learns from. By the time you notice, the attack is alre...
attacks
poisoning
training-time
Nov 2025
The Threat Model Nobody Talks About in Intro ML
You spend months learning how to train a model. Nobody tells you that a well-trained model is also a well-optimized attack surface. Here's what that actually...
foundations
threat-models
attacks