11 min read
Why Continuous Behavioral Cloning Fails Exponentially (and How Robotics Gets Away With It)
robotics imitation-learning
Table of Contents

Behavioral cloning has a well-known failure mode: small errors compound, and the policy drifts into states the expert never visited. The standard analysis bounds this drift as quadratic in the horizon — bad, but manageable. That bound assumes discrete actions. When actions are continuous, we show the picture is fundamentally worse: error grows exponentially in the horizon. Yet methods like ACT and diffusion policy routinely solve long-horizon manipulation tasks. What are they actually doing to survive?

The Quadratic Bound

Behavioral cloning is supervised learning on expert demonstrations: collect pairs and train a policy to predict the action from the state. The trouble is that at test time the policy’s predictions determine what states it sees next. A small mistake pushes it into unfamiliar states, causing further mistakes. This is covariate shift, and the drift is self-reinforcing.

Ross and Bagnell (2010) formalized how bad this compounding can get. Suppose we want to imitate an expert policy over a -step task. We train a policy by behavioral cloning, and it achieves a small per-step error rate: on states the expert actually visits, disagrees with the expert with probability at most . Small means is good at imitating the expert. But is it good at the actual task? Write for the expected total cost of running for steps on its own — no expert to correct it. How much can exceed ?

Theorem (Ross & Bagnell, 2010)

The excess cost of behavioral cloning over the expert grows quadratically in the horizon:

where is the per-step imitation error under the expert’s state distribution.

Proof (informal)

At each of the steps, the policy has roughly an chance of deviating from the expert. Once it deviates, it’s in unfamiliar states, potentially paying cost for every remaining step. An error at step can cause damage for all remaining steps, so the total expected cost is approximately .

Proof

Setup. We work with deterministic policies. Define:

  • : immediate cost of taking action in state
  • : 0-1 error indicator
  • : state distribution at step under
  • : averaged state distribution
  • : per-step error at time
  • : expected -step cost
  • : probability that at every state visited by through steps

Claim. Let satisfy . Then .

Step 1: Decompose trajectories. Define:

  • : state distribution at step under , conditioned on agreeing at steps
  • : state distribution at step under , conditioned on disagreeing at some step in

By the law of total probability:

Two consequences follow by dropping non-negative terms. First, expanding and dropping the term:

Second, expanding and dropping the terms:

Step 2: Bound per-step cost. Conditioned on agreement through step , both policies took identical actions into identical dynamics, so is also the learner’s state distribution at step given no prior mistakes. With probability the learner is on-track and faces states from ; otherwise it faces arbitrary states and pays at most :

Step 3: Bound on-track cost. When the learner is on the expert’s trajectory, at each state it either matches the expert and pays , or deviates and pays at most . So . Taking expectations weighted by and applying (i):

Step 4: Bound the drift probability. Staying on-track through step requires being on-track through and not erring at . By (i):

Unrolling from gives , so:

Step 5: Assemble. Substituting (iv) into (iii) gives . Bounding by (v) and collecting the terms:

Summing over and applying (ii) gives . Since for all , the double sum is at most . And since by assumption, this is at most :

This bound is tight. Consider three states . The expert takes action at , transitions to , and stays there forever at zero cost. A wrong action at sends the agent to a trap state , where it has never seen the correct action and incurs cost every step. The expert visits only at step , so it is a fraction of the average state distribution. A policy that errs at with probability achieves average error under the expert distribution. But at deployment: with probability , it falls into and pays cost for all steps, giving expected cost .

So is the true price of covariate shift with discrete actions. But the bound relies on a hidden assumption: that exact imitation is possible. With discrete actions, the policy either picks the expert’s action or it doesn’t. With continuous actions, it can’t — no matter how small the error gets, some residual always remains, and as we’ll see, that residual can compound exponentially.

The Exponential Bound

Now suppose actions are continuous — the learner outputs a vector, not a discrete choice. The setup is the same: observe expert trajectories, train a policy, deploy it in closed loop. But error is no longer 0-1. The policy can’t exactly match the expert’s action; there’s always some residual. The question is the same: how much can exceed ?

Simchowitz, Pfrommer, and Jadbabaie (2025) show that for a broad class of policies they call simple, error compounds exponentially in the horizon. For continuous states and continuous actions , a simple policy is one satisfying three properties:

  1. Smooth. The deterministic component is -Lipschitz and twice differentiable with second derivatives bounded by .
  2. Simply-stochastic. The noise shape doesn’t depend on state — only the mean shifts with . Deterministic policies and Gaussians with fixed both qualify.
  3. Markovian. The policy maps the current state to an action with no dependence on history or timestep.

This is not a contrived class — it is standard behavioral cloning. Any neural network trained with loss to predict the expert’s action from the current state qualifies, whether deterministic or with fixed-variance Gaussian noise.

Theorem (informal; Simchowitz et al., 2025)

Even when the environment is very stable, for any simple algorithm there exists a task where its error grows exponentially in the horizon:

where is the per-step imitation error under the expert’s state distribution.

Theorem (Simchowitz et al., 2025)

Setup. States , actions , deterministic dynamics .

The dynamics are exponentially incrementally input-to-state stable (E-IISS): there exist , such that for any two state-input trajectories,

Initial state errors and past input perturbations both decay exponentially with age. The system actively contracts disturbances.

The expert-distribution error measures imitation quality on expert states:

Here is the state distribution at step under expert rollouts. is exactly what behavioral cloning minimizes, and exactly what says nothing about closed-loop behavior.

Theorem. Fix , , and set . There exists a family of E-IISS instances with , smooth dynamics, and -smooth deterministic experts such that:

(a) A proper, simple algorithm achieves training error .

(b) For any simple algorithm, there is an instance in the family on which:

The pair parameterizes state dimension and expert smoothness, and is the minimax rate for learning an -smooth function on from noiseless samples. So (a) says training is as statistically tractable as possible, and (b) says it still blows up exponentially at deployment. The floor is a saturation term for very smooth learners.

Proof (informal)

Consider two 2D linear systems with coordinates . In both, the expert drives to zero along the -axis while holding . The demonstrations from the two systems are identical, and every demo has throughout.

The systems secretly differ off-axis: stabilizing a perturbation in requires a negative-sign feedback for system A and a positive-sign feedback for system B. Each expert applies the right sign, but since on every demo, the data never reveals it.

The learner must therefore extrapolate, and smoothness forces it to commit to a single sign: stabilizing one of the two systems and destabilizing the other. At deployment, any slight drift off the -axis on the bad system triggers a wrong-sign correction, which pushes further off, which triggers a larger wrong correction. The deviation grows geometrically through closed-loop feedback — that is the .

The bound applies to any non-interactive algorithm — behavioral cloning, offline RL, inverse RL — provided the returned policy is simple: smooth, simply-stochastic, and Markovian. Read the other way, the assumptions are a blueprint: violate any one and the exponential blowup is no longer forced.

Breaking the Bound

The bound rests on four assumptions — non-interactivity, smoothness, simple stochasticity, and the Markov property — and each is a potential axis of attack. The three methods below predate the theory; they were discovered empirically, and the bound arrived later to explain why they work. Read forward, though, the framing is generative: the fourth axis, still unbroken, is where new algorithms might live.

DAgger — short for Dataset Aggregation — breaks the non-interactive assumption. Instead of training only on expert demonstrations, DAgger rolls out the learned policy, visits states the learner actually encounters, and queries the expert for the correct action at those states. The loss is now evaluated under rather than , which eliminates the covariate shift that drives both the quadratic and exponential bounds. The catch is that it requires expert access during training, which in robotics means a human teleoperating on demand at every iteration.

Action chunking breaks the Markov property by predicting a chunk of actions from the current state and executing them open-loop, so mid-chunk actions depend on the state at the start of the chunk, not the current state. The exponential blowup in the lower bound comes from a destabilizing feedback loop: the policy observes its own drift, applies the wrong correction, drifts further, repeat. Within a chunk there is no feedback, and under E-IISS open-loop perturbations decay exponentially. The same stability assumption that powers the lower bound is what makes chunking work.

Diffusion policies break simple stochasticity. A simply-stochastic policy can shift the mean of its output distribution with state but not reshape it. The lower bound exploits exactly this: two systems require opposite feedback signs, and a simply-stochastic policy must commit to one sign everywhere, destabilizing the other. A diffusion policy’s output distribution changes shape with the input state, so it can maintain both signs as separate modes and sample contextually in the ambiguous region.

References

  • Ross, S. & Bagnell, J. A. (2010). “Efficient Reductions for Imitation Learning.” Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS). PMLR:v9/ross10a (supplementary)
  • Simchowitz, M., Pfrommer, T., & Jadbabaie, A. (2025). “The Pitfalls of Imitation Learning when Actions are Continuous.” arXiv:2503.09722