Behavioral cloning has a well-known failure mode: small errors compound, and the policy drifts into states the expert never visited. The standard analysis bounds this drift as quadratic in the horizon — bad, but manageable. That bound assumes discrete actions. When actions are continuous, we show the picture is fundamentally worse: error grows exponentially in the horizon. Yet methods like ACT and diffusion policy routinely solve long-horizon manipulation tasks. What are they actually doing to survive?
The Quadratic Bound
Behavioral cloning is supervised learning on expert demonstrations: collect
Ross and Bagnell (2010) formalized how bad this compounding can get. Suppose we want to imitate an expert policy
Theorem (Ross & Bagnell, 2010)The excess cost of behavioral cloning over the expert grows quadratically in the horizon:
where
is the per-step imitation error under the expert’s state distribution.
Proof (informal)At each of the
steps, the policy has roughly an chance of deviating from the expert. Once it deviates, it’s in unfamiliar states, potentially paying cost for every remaining step. An error at step can cause damage for all remaining steps, so the total expected cost is approximately .
ProofSetup. We work with deterministic policies. Define:
: immediate cost of taking action in state : 0-1 error indicator : state distribution at step under : averaged state distribution : per-step error at time : expected -step cost : probability that at every state visited by through steps Claim. Let
satisfy . Then . Step 1: Decompose trajectories. Define:
: state distribution at step under , conditioned on agreeing at steps : state distribution at step under , conditioned on disagreeing at some step in By the law of total probability:
Two consequences follow by dropping non-negative terms. First, expanding
and dropping the term:
Second, expanding
and dropping the terms:
Step 2: Bound per-step cost. Conditioned on agreement through step
, both policies took identical actions into identical dynamics, so is also the learner’s state distribution at step given no prior mistakes. With probability the learner is on-track and faces states from ; otherwise it faces arbitrary states and pays at most :
Step 3: Bound on-track cost. When the learner is on the expert’s trajectory, at each state it either matches the expert and pays
, or deviates and pays at most . So . Taking expectations weighted by and applying (i):
Step 4: Bound the drift probability. Staying on-track through step
requires being on-track through and not erring at . By (i):
Unrolling from
gives , so:
Step 5: Assemble. Substituting (iv) into (iii) gives
. Bounding by (v) and collecting the terms:
Summing over
and applying (ii) gives . Since for all , the double sum is at most . And since by assumption, this is at most :
This bound is tight. Consider three states
So
The Exponential Bound
Now suppose actions are continuous — the learner outputs a vector, not a discrete choice. The setup is the same: observe expert trajectories, train a policy, deploy it in closed loop. But error is no longer 0-1. The policy can’t exactly match the expert’s action; there’s always some
Simchowitz, Pfrommer, and Jadbabaie (2025) show that for a broad class of policies they call simple, error compounds exponentially in the horizon. For continuous states
- Smooth. The deterministic component
is -Lipschitz and twice differentiable with second derivatives bounded by . - Simply-stochastic. The noise shape doesn’t depend on state — only the mean shifts with
. Deterministic policies and Gaussians with fixed both qualify. - Markovian. The policy maps the current state to an action with no dependence on history or timestep.
This is not a contrived class — it is standard behavioral cloning. Any neural network trained with
Theorem (informal; Simchowitz et al., 2025)Even when the environment is very stable, for any simple algorithm there exists a task where its error grows exponentially in the horizon:
where
is the per-step imitation error under the expert’s state distribution.
Theorem (Simchowitz et al., 2025)Setup. States
, actions , deterministic dynamics . The dynamics are exponentially incrementally input-to-state stable (E-IISS): there exist
, such that for any two state-input trajectories,
Initial state errors and past input perturbations both decay exponentially with age. The system actively contracts disturbances.
The expert-distribution error measures
imitation quality on expert states:
Here
is the state distribution at step under expert rollouts. is exactly what behavioral cloning minimizes, and exactly what says nothing about closed-loop behavior. Theorem. Fix
, , and set . There exists a family of E-IISS instances with , smooth dynamics, and -smooth deterministic experts such that: (a) A proper, simple algorithm achieves training error
. (b) For any simple algorithm, there is an instance in the family on which:
The pair
parameterizes state dimension and expert smoothness, and is the minimax rate for learning an -smooth function on from noiseless samples. So (a) says training is as statistically tractable as possible, and (b) says it still blows up exponentially at deployment. The floor is a saturation term for very smooth learners.
Proof (informal)Consider two 2D linear systems with coordinates
. In both, the expert drives to zero along the -axis while holding . The demonstrations from the two systems are identical, and every demo has throughout. The systems secretly differ off-axis: stabilizing a perturbation in
requires a negative-sign feedback for system A and a positive-sign feedback for system B. Each expert applies the right sign, but since on every demo, the data never reveals it. The learner must therefore extrapolate, and smoothness forces it to commit to a single sign: stabilizing one of the two systems and destabilizing the other. At deployment, any slight drift off the
-axis on the bad system triggers a wrong-sign correction, which pushes further off, which triggers a larger wrong correction. The deviation grows geometrically through closed-loop feedback — that is the .
The bound applies to any non-interactive algorithm — behavioral cloning, offline RL, inverse RL — provided the returned policy is simple: smooth, simply-stochastic, and Markovian. Read the other way, the assumptions are a blueprint: violate any one and the exponential blowup is no longer forced.
Breaking the Bound
The bound rests on four assumptions — non-interactivity, smoothness, simple stochasticity, and the Markov property — and each is a potential axis of attack. The three methods below predate the theory; they were discovered empirically, and the bound arrived later to explain why they work. Read forward, though, the framing is generative: the fourth axis, still unbroken, is where new algorithms might live.
DAgger — short for Dataset Aggregation — breaks the non-interactive assumption. Instead of training only on expert demonstrations, DAgger rolls out the learned policy, visits states the learner actually encounters, and queries the expert for the correct action at those states. The loss is now evaluated under
Action chunking breaks the Markov property by predicting a chunk of
Diffusion policies break simple stochasticity. A simply-stochastic policy can shift the mean of its output distribution with state but not reshape it. The lower bound exploits exactly this: two systems require opposite feedback signs, and a simply-stochastic policy must commit to one sign everywhere, destabilizing the other. A diffusion policy’s output distribution changes shape with the input state, so it can maintain both signs as separate modes and sample contextually in the ambiguous region.
References
- Ross, S. & Bagnell, J. A. (2010). “Efficient Reductions for Imitation Learning.” Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS). PMLR:v9/ross10a (supplementary)
- Simchowitz, M., Pfrommer, T., & Jadbabaie, A. (2025). “The Pitfalls of Imitation Learning when Actions are Continuous.” arXiv:2503.09722