Here’s the problem with adversarial training as a defense: you train on adversarial examples generated by attack A, but a different attacker using attack B might still succeed. You’ve raised the bar, but you haven’t proven anything. Certified defenses prove — mathematically — that no adversarial example can exist within a given perturbation radius, regardless of the attack.
Randomized Smoothing
The most scalable certified defense. Given a base classifier g, construct a smoothed classifier f by returning the most likely class under Gaussian noise:
The radius r_x depends on the gap between the top and second class probabilities. Larger σ means larger certified radii but worse clean accuracy — the fundamental tradeoff.
NeurIPS 2024: Scaling with Diffusion Data
A NeurIPS 2024 paper (Müller et al.) showed that generating additional training data using state-of-the-art diffusion models substantially improves deterministic certified defenses. This mirrors how diffusion-generated data was previously shown to improve empirical adversarial training.
But the paper also reveals important differences: certified robustness is considerably harder to scale than empirical robustness. Once data saturation is reached, further gains require better algorithms or larger models — you can’t just generate more data. This is a meaningful constraint for practitioners hoping to close the clean/certified accuracy gap.
Multi-Step Certified Defenses
A second NeurIPS 2024 paper (Certified Adversarial Robustness for Multi-Step Defences) addresses a limitation of standard Randomized Smoothing — it’s static at test time even though attacks adapt. The proposed Adaptive Randomized Smoothing (ARS) uses a two-step defense: first, compute an input mask focusing on task-relevant information; second, apply RS on the dimensionality-reduced input. The key connection: RS can be reconnected to f-DP (a notion of differential privacy) to yield tighter certificates.
The Clean/Certified Tradeoff
No certified defense avoids this: higher certified accuracy costs clean accuracy. The Pareto frontier — the best achievable (clean, certified) pairs — shifts with the defense method. Current state of the art on CIFAR-10 at ε=8/255 achieves roughly 60-70% clean accuracy with ~35-40% certified accuracy. Non-DP models achieve ~95% clean accuracy. That gap is the cost of the guarantee.
References: Cohen et al. (2019) ICML; Müller et al. (2024) NeurIPS; Zhang et al. (2024) NeurIPS (multi-step ARS); Madry et al. (2018) ICLR.