Critical Interval MSE: Toward Reliable Offline Validation for Robot Policy

Anonymous

Critical Interval MSE
Toward Reliable Offline Validation for Robot Policy

Anonymous Author(s)
Anonymous Institution

Evaluating robot policies with real rollouts is reliable but costly, while raw offline MSE is cheap but often misleading. (Left) CI-MSE focuses validation on task-critical intervals; it yields offline rankings that better match rollout performance. (Right)

TL;DR

Real-world robot evaluation is costly, while raw MSE often misses the actions that determine success. Critical Interval MSE (CI-MSE) validates only task-critical segments and matches rollout-time action alignment, improving simulation rank correlation with rollout success from raw MSE's -0.61 to -0.87 across 27 policy checkpoints, a 0.26 absolute gain toward the ideal -1 ordering.

Abstract

Real-world evaluation is the gold standard for robot manipulation policies because it tests them against the physical conditions and deployment challenges they are ultimately designed to handle. However, real-world evaluation is also the bottleneck for iterating on robot policies: it is costly, difficult to reproduce, and often too sparse to reliably compare nearby model variants. A straightforward proxy for performance is validation loss on expert demonstrations, but this proxy is often poorly correlated with real-world performance. In this paper, we introduce Critical Interval MSE (CI-MSE), an intuitively simple yet effective offline validation metric. CI-MSE restricts error computation to task-critical segments and pairs it with simple action-alignment procedures that better match rollout-time behavior. Across simulation and real-world experiments, CI-MSE yields a stronger correlation between validation error and rollout performance than raw MSE. Across a wide range of policy checkpoints, CI-MSE achieves a Spearman's rank correlation of -0.87, much closer to the ideal value of -1 than raw MSE's -0.61, demonstrating a significant improvement. We show through sensitivity analysis that our metric is robust to a wide range of hyperparameters. We further study the effectiveness of CI-MSE under evaluation distribution shifts and suggest design boundaries when using this metric. In summary, this paper provides a simple and reliable offline validation tool for accelerating policy iteration.

Execution Records

Intuition

Critical intervals in a robot manipulation trajectory — Aggregated MSE is dominated by uncritical move actions, while CI-MSE focuses on the critical grasp stage.

Offline validation can be misleading because many trajectory timesteps are not decisive for task completion, yet they can contribute large action errors. In the bottle-transfer example, transition motions may vary safely across demonstrations, while the grasp stage is brief, contact-rich, and tightly tied to success. CI-MSE therefore measures error on critical intervals instead of averaging across the whole trajectory, so the validation signal is less obscured by uninformative motion.

Pour Water demo: the robot first grasps a randomly placed drinking bottle, then pours water into a mug, and finally places the bottle on a red coaster.

Approach

STEP 1: Critical interval filter. Identify contiguous task-critical segments such as contact, grasping, insertion, or fine alignment, then compute error only within those intervals. This removes long transit or idle phases that can dominate raw MSE without determining success.

STEP 2: Rollout-aware action alignment. Apply the same deployment-time inference method, such as temporal ensembling or RTC, during offline validation. The measured error then reflects executed action chunks rather than raw one-step predictions.

STEP 3: Local temporal matching. Use DTW with a small warping window to compare predicted and expert action sequences under minor timing shifts. This discounts harmless phase offsets while keeping large reorderings or failures visible.

Simulation Performance

In simulation, CI-MSE is evaluated on a large-scale manipulation benchmark with many policy variants. The comparison includes changes in model architecture, training data scale, training steps, parameter-efficient finetuning, action-head size, and VLM backbone. The key question is whether an offline validation metric can preserve the same ordering as rollout success rate.

Demo videos from the data scale variant family.

Success Rate vs Validation Error

Interactive chart data is unavailable because JavaScript is disabled. The simulation results compare CI-MSE and Raw MSE validation errors against rollout success and rank ordering.

The simulation results show that raw MSE can be reliable for some controlled checkpoint families, but can mis-rank policies when behavior changes through data scale, architecture, or backbone choice. CI-MSE is designed to be more robust in exactly these cases because it focuses on the trajectory segments most responsible for task completion.

Real-World Performance

Real-world evaluation tests whether the simulation trend survives physical deployment. The paper studies diffusion policies on a Franka arm across pour-water, pick-place-mouse, fold-towel, and unplug tasks, using cross-environment and cross-object validation shifts.

Because real-world trial budgets are small and partial success scores can be noisy, the paper uses Elo-style pairwise ranking as the main rollout target. This makes model ordering more stable and gives offline validation a cleaner target for correlation analysis.

The results also show an important practical caveat: validation demonstrations should be collected under conventions similar to the training data. When operators have different action styles, validation error may partly measure style mismatch rather than policy quality.

Validation metric	Collectors matched				Collectors mismatched
	Pour water		Arrange mouse		Fold towel		Unplug
	Env. r↓	Obj. r↓	Env. r↓	Obj. r↓	Env. r↓	Obj. r↓	Env. r↓	Obj. r↓
CI-MSE	-0.99	-0.99	-0.96	-0.66	-0.05	-0.87	0.48	-0.53
Raw MSE	-0.47	-0.86	-0.80	-0.06	0.27	-0.09	0.62	-0.86

Pearson correlation between validation error and real-world rollout Elo score. Each task is evaluated across unseen environments (Env.) and unseen objects (Obj.).

Takeaway

CI-MSE is effective, improving the agreement between offline validation and rollout performance over raw MSE.
CI-MSE is validated in both simulation and real-world experiments, and is robust across a wide range of hyperparmeter selction.
The code repository provides an automated pipeline for building customized offline validation benchmark.

Critical Interval MSE Toward Reliable Offline Validation for Robot Policy