Here’s the problem with adversarial training as a defense: you train on adversarial examples generated by attack A, but a different attacker using attack B might still succeed. You’ve raised the bar, but you haven’t proven anything. Certified defenses prove — mathematically — that no adversarial example can exist within a given perturbation radius, regardless of the attack.

Randomized Smoothing

The most scalable certified defense. Given a base classifier g, construct a smoothed classifier f by returning the most likely class under Gaussian noise:

Randomized Smoothing (Cohen et al., ICML 2019)
f(x) = argmax_c Pr_{z~N(0,σ²I)} [g(x+z) = c] Certification radius: r_x = (σ/2) · (Φ⁻¹(p_A) − Φ⁻¹(p_B)) Guarantee: f(x+δ) = f(x) for all ‖δ‖₂ ≤ r_x

The radius r_x depends on the gap between the top and second class probabilities. Larger σ means larger certified radii but worse clean accuracy — the fundamental tradeoff.

NeurIPS 2024: Scaling with Diffusion Data

A NeurIPS 2024 paper (Müller et al.) showed that generating additional training data using state-of-the-art diffusion models substantially improves deterministic certified defenses. This mirrors how diffusion-generated data was previously shown to improve empirical adversarial training.

But the paper also reveals important differences: certified robustness is considerably harder to scale than empirical robustness. Once data saturation is reached, further gains require better algorithms or larger models — you can’t just generate more data. This is a meaningful constraint for practitioners hoping to close the clean/certified accuracy gap.

Multi-Step Certified Defenses

A second NeurIPS 2024 paper (Certified Adversarial Robustness for Multi-Step Defences) addresses a limitation of standard Randomized Smoothing — it’s static at test time even though attacks adapt. The proposed Adaptive Randomized Smoothing (ARS) uses a two-step defense: first, compute an input mask focusing on task-relevant information; second, apply RS on the dimensionality-reduced input. The key connection: RS can be reconnected to f-DP (a notion of differential privacy) to yield tighter certificates.

The Clean/Certified Tradeoff

No certified defense avoids this: higher certified accuracy costs clean accuracy. The Pareto frontier — the best achievable (clean, certified) pairs — shifts with the defense method. Current state of the art on CIFAR-10 at ε=8/255 achieves roughly 60-70% clean accuracy with ~35-40% certified accuracy. Non-DP models achieve ~95% clean accuracy. That gap is the cost of the guarantee.


References: Cohen et al. (2019) ICML; Müller et al. (2024) NeurIPS; Zhang et al. (2024) NeurIPS (multi-step ARS); Madry et al. (2018) ICLR.