Physical Intelligence’s VLA lineage, 2024–2026. Each generation compared across architecture, training recipe, contributions, and limitations. Scroll horizontally to move through the timeline.
Backbone: Gemma 2B → 2.6B → 4B (and stabilizes at 4B). Action expert: 300M → 300M → 860M (big jump at π0.6).
Training recipe: pure flow matching (π0) → hybrid FAST + flow, two-stage (π0.5) → Knowledge Insulation with stop-gradient (π0.6+). KI is the structural backbone for everything after.
Supervision evolution: demos-only → + hierarchy/verbal/web → + metadata → + RL reward + advantage conditioning + interventions → + subgoal images + diverse “how” metadata + autonomous/failure data + egocentric human video.
Philosophy shift per generation: “can we generalize to new homes?” → “can we get specialist-level out-of-box performance?” → “can VLAs learn from experience via RL?” → “can we get compositional generalization via rich prompting?”
π02024 Foundation paper | π0.52025 Generalize to new homes | π0.6 / π*0.6Nov 2025Two separate papers, same policy architecture. π0.6 (Nov 17) is the base model trained with supervised KI. π*0.6 (Nov 19) takes π0.6's weights and further trains them with RECAP — offline RL pre-train → SFT → iterative rollouts + interventions, plus a training-only 670M value function. Different checkpoints, but one release cycle. Specialist-level out of the box + RECAP RL recipe | π0.72026 Compositional generalization | |
|---|---|---|---|---|
| Size | 3.3B: PaliGemma 3B (SigLIP 400M + Gemma 2B) + 300M action expert | SigLIP 400M + Gemma 2.6B + 300M action expert. Two-stage: pre-train 280k steps (discrete only) → post-train 80k steps (adds flow matching). | SigLIP 400M + Gemma 3 4B + 860M action expert (same depth as backbone). π*0.6 adds a separate 670M distributional value function (Gemma 3 + SigLIP 400M, 201 value bins) for RL; identical policy architecture. | ~5B: Gemma 3 4B + 400M SigLIP + 860M action expert + MEM video history encoder (6 history frames @ 1s stride, compressed to single-frame token count). Up to 4 cameras + up to 3 subgoal images @ 448×448. Separate 14B BAGEL-initialized world model for subgoal generation. |
| Key ideas | VLM + separate action expert MoE branch; conditional flow matching for continuous action chunks (H=50, 10 integration steps, 50 Hz); cross-embodiment pre-training | Heterogeneous co-training: mobile (~400 h, ~2.4% of pre-training mixture), non-mobile diverse env (ME), lab cross-embod (CE), high-level subtask (HL), verbal instructions (VI), web data (WD). Hierarchical inference in one unified model. Hybrid training: FAST discrete tokens in pre-training, flow matching in post-training. | π0.6 (base): Keep π0.5 hierarchy. Bigger backbone + bigger action expert. Metadata conditioning in the prompt. Full Knowledge Insulation (KI) training: VLM predicts FAST tokens, action expert predicts continuous actions, stop-gradient prevents action-expert gradients flowing into the VLM. Up to 4 images at 448×448. π*0.6 (RECAP RL on top): adds binarized advantage indicator It to the prompt (CFGRL-style). (1) Train a multi-task value function via cross-entropy on discretized MC returns. (2) Per-task threshold εℓ at ~30–40th percentile. (3) Offline RL pre-train → SFT → iterate: autonomous rollouts + HG-DAgger interventions → retrain value + policy. (4) Sparse reward: −1 per step, −Cfail on failure, 0 on success; value = expected negative steps-to-completion. | Diverse prompt conditioning: detailed language + subtask ℓ̂ + multi-view subgoal images + metadata (speed/quality/mistake flag) + control mode. Each component randomly dropped during training → flexible inference. Language coaching: humans give step-by-step instructions to teach new long-horizon tasks without action data, then distill into a high-level policy. Train on much more diverse data including suboptimal autonomous rollouts (incl. π*0.6 RL training data), failures, egocentric human video. Subgoals refreshed on subtask change or every 4s. |
| Core contributions | First flow-matching VLA with high-frequency chunks; pre-training/post-training recipe analogous to LLMs; ~10,000 hours — largest robot experiment at the time | First end-to-end system doing 10–15 min dexterous tasks in entirely new homes. Co-training recipe — each ablation hurts. 104-location model matches one trained directly on test homes. | π0.6: specialist-level out-of-box performance without task-specific fine-tuning. Folds laundry reliably + fully assembles boxes 20% out of box (previously needed fine-tuning for non-zero success). ~2× throughput vs π0.5. 63 ms/chunk on single H100. π*0.6: first general RL recipe that scales to large flow-matching VLAs via advantage conditioning. Handles good + bad data together. >2× throughput, ~2× failure reduction on hardest tasks. 13 h continuous espresso, factory-grade box assembly. Beats AWR and PPO baselines. | Strong signs of compositional generalization — the “grand challenge”. (1) Out-of-box specialist-level dexterity matching π*0.6 RL specialists, no RL or fine-tuning needed. (2) Zero-shot cross-embodiment transfer — folds t-shirts on bimanual UR5e (never trained for this), matching expert teleoperators (85.6%/80% vs 90.9%/80.6%). (3) Discovers new manipulation strategies suited to target embodiment (vertical grasps on UR5e). (4) New short-horizon tasks out of box. (5) New long-horizon tasks via coaching → autonomous distillation. (6) Breaks dataset biases (reverse bussing). |
| Results & shortcomings | Beat OpenVLA/Octo by large margins on shirt folding, bussing, bagging. Fine-tuning needed for laundry/boxes. Shortcomings: not all tasks work reliably, data composition unclear, positive transfer to navigation/locomotion unexplored, language following relies on sub-task decomposition. | Strong on dishes-in-sink, items-in-drawer, laundry, make-bed in 3 real unseen homes. Shortcomings: persistent challenges with unfamiliar affordances (weird handles, hard-to-open cabinets), partial observability (arm occluding spill), high-level inference loops (open/close drawer repeatedly), only simple prompts, modest context window. | π0.6: big gains on shirt/laundry folding, box assembly, table bussing, mobile tasks, generalization benchmarks. Shortcomings: still relies on fine-tuning for some specialists; card doesn't deeply analyze failure modes. π*0.6: espresso, box assembly, laundry all reach 90%+. Shortcomings: not fully autonomous — needs human reward labels, interventions, resets. Naive exploration. Iterated offline RL (not concurrent online). Corrections don't fix overall speed or subtle behaviors. MC on-policy value — could benefit from off-policy Q-learning. | Seen tasks often >90%; unseen tasks / novel embodiment combos 60–80%. Mixed-quality data scaling: without metadata, more/noisier data hurts; with metadata it keeps improving. Shortcomings: unseen success rate still well below in-dist. Very hard to define what's “truly unseen” in such a diverse dataset. World model inference is expensive (1.25 s/subgoal with 4×H100 + 8-bit + SageAttention). Only manipulation, no navigation/locomotion claims. |
| Diff vs prev | Foundation paper. | Hierarchical inference in the same model (π0 used external VLM). Introduced FAST + hybrid training (π0 was pure flow matching). Mobile-manipulation focus. Web co-training. Verbal instruction demos as novel supervision modality. | π0.6 vs π0.5: Gemma 2.6B → 4B. Action expert 300M → 860M (~3×). Metadata conditioning. Full KI recipe. π*0.6 vs π0.6: adds advantage indicator token in the prompt + 670M value function + full RL training loop (offline RL pre-train → SFT → K iterations of rollouts + retraining). Policy architecture identical; different checkpoint produces task-specific RL specialists. | Multi-modal prompt (language + subgoals + metadata + control mode). MEM memory encoder. Separate BAGEL-based world model. Egocentric human video + suboptimal autonomous data + π*0.6 RL rollouts in training. Training-time RTC for latency tolerance. Emphasis on steerability and emergent generalization rather than pure throughput/robustness. |
| Appendix | Blockwise causal mask with 3 blocks ([images+text], [state], [noisy actions]). Action expert: width=1024, mlp_dim=4096. Beta(1.5,1) timestep with s=0.999 (up to 1,000 integration steps). Temporal ensembling was tried and hurt performance — chunks executed open-loop. 73 ms onboard / 86 ms off-board on RTX 4090. | Action expert tokens do NOT attend to FAST tokens (avoids info leakage between two action representations). Adaptive RMSNorm for τ injection (vs. input fusion in π0). α=10.0 post-training. Image aug: crop 0.95×, rotate ±5°, color jitter. | π0.6 card: no formal appendix — 4-page card. Image tokens bidirectional, text tokens now causal. π*0.6: advantage dropout 30% for test-time CFG. β ∈ [1.5, 2.5] — high β pushes actions to support boundaries (aggressive). PPO baseline needed SPO-style constraint with η=0.01 for stability. Detailed flow-matching ELBO decomposition (AR + diffusion). Data per task: T-shirt laundry uses autonomous-only data (no corrections); diverse laundry uses 450 autonomous + 287 correction eps; box assembly uses 600 autonomous + 360 correction eps/iter on 3 robots; cafe uses 414 autonomous + 429 correction eps. | Subgoal sampling: 25% of examples have subgoal images; within those, 25% use end-of-segment, 75% sample 0–4s ahead uniformly. Subtask instruction dropped 30% when image is present. Metadata dropped 15% entirely; each component +5% individually. State uses linear projection, not text tokenization (change from π0.6). Minimal variant: 38 ms/chunk on single H100; 127 ms with MEM + subgoal. World model: 4×H100 tensor-parallel, 8-bit matmuls, SageAttention → 1.25 s per subgoal, 25 denoising steps. Ablation on prior models (π0.5, π0.6): EE control showed no substantial advantage over joint control on cross-embodiment tasks, so main cross-embodiment experiments use joint-space control. |
Concrete numbers pulled from each paper: architecture configs, training recipe, data composition, inference latency, and RL/world-model specifics. Dashes mean the component doesn’t exist for that generation.
| π0 | π0.5 | π0.6 / π*0.6 | π0.7 | |
|---|---|---|---|---|
| Architecture | ||||
| Total params | 3.3B | 400M + 2.6B + 300M (total not stated) | 400M + 4B + 860M (total not stated). π*0.6: + 670M value function | ~5B + separate 14B world model |
| VLM backbone | PaliGemma 3B (SigLIP 400M + Gemma 2B) | SigLIP 400M + Gemma 2.6B | Gemma 3 4B (policy). π*0.6: value function uses same Gemma 3 + SigLIP 400M architecture | Gemma 3 4B (incl. 400M SigLIP) |
| Action expert | 300M | 300M | 860M (same depth as backbone); identical in π*0.6 | 860M |
| Action expert config | width=1024, mlp_dim=4096 | width=1024, mlp_dim=4096 | not stated | not stated |
| Image resolution | not stated | not stated | 448×448 | 448×448 (VAE inputs 512×384 in WM) |
| Max cameras | 2–3 per robot | up to 4 (front/back/2× wrist) | up to 4 (base/2× wrist/optional back). π*0.6: experiments use 3 (base + 2× wrist) on static bimanual | up to 4 + up to 3 subgoal images |
| History frames | — | — | — | 6 @ 1s stride (MEM encoder) |
| State encoding | linear projection | discretized text tokens | discretized text tokens | linear projection |
| Attention pattern | blockwise causal, 3 blocks: [images+text] [state] [actions] | prefix mask on images/prompt/state; FAST causal on prefix + prior FAST; action expert bidir on prefix, no attend to FAST | bidir images, causal text, bidir action tokens. π*0.6: adds advantage indicator token in prompt | block-causal: obs + subgoal bidir within; goal images attend obs; text causal |
| Action chunk length | 50 (H=50 in paper) | 50 (H=49 in paper) | not stated | 50 (exec 15–25 steps per chunk) |
| Control frequency | up to 50 Hz (20 Hz on UR5e/Franka) | 50 Hz | not stated. π*0.6: experiments: 50 Hz (static bimanual, joint) | 50 Hz (20 Hz on UR5e) |
| Training recipe | ||||
| Objective | conditional flow matching | hybrid: FAST tokens (pre) + flow matching (post) | Knowledge Insulation (FAST in VLM, flow in action expert, stop-grad). π*0.6: adds advantage-conditioned policy (CFGRL-style) | KI + diverse prompt dropout |
| Stages | pre-train + task-specific post-train | pre-train 280k (discrete only, α=0) + post-train 80k (adds flow, α=10) | single-stage KI. π*0.6:: offline RL pre-train → SFT → K iterations of { collect data, retrain V, retrain π } | not stated |
| Timestep distribution | Beta((s−τ)/s; 1.5, 1), s=0.999 | same as π0 | not stated | not stated |
| Timestep injection | MLP fused into action token embedding | separate MLP + adaptive RMSNorm per layer | not stated | adaptive RMSNorm |
| Loss weight α (post) | — | 10.0 | — | — |
| Image augmentation | not stated | RandomCrop 0.95×, Rotate ±5°, ColorJitter(0.3/0.4/0.5) | not stated | not stated |
| Action normalization | not stated | [−1, 1] via 1/99% quantile per dim | not stated | not stated |
| Data | ||||
| Own robot data | ~10,000 h / 903M timesteps / 7 robot configs / 68 tasks | ~400 h mobile manipulation (2.4% of pre-training mixture) + diverse non-mobile + lab cross-embodiment | “largely inherits” π0.5 composition. π*0.6: adds on-policy rollouts per task: diverse laundry (450 auto + 287 correction), box (600 auto + 360 correction / iter on 3 robots), cafe (414 auto + 429 correction); T-shirt laundry is auto-only | demonstrations + autonomous rollouts (incl. π*0.6 RL data) + failures + egocentric human video |
| External data | OXE (9.1%), Bridge v2, DROID | OXE + multimodal web | same as π0.5 (per card) | same + open-source image-editing + open video datasets (for world model) |
| Web co-training tasks | — | captioning, VQA, object localization (bounding box / keypoint) | bounding box + keypoint prediction + general multi-modal web | same + video captioning (robot + web) |
| Metadata in prompt | — | — | “conditioning metadata” in prompt (content not specified in card). π*0.6: adds binarized advantage indicator It | overall speed (500-step bins, e.g. “2000”) + quality (1–5) + mistake (bool) + control mode (joint/ee) |
| Language supervision | task names + ~2s segment annotations | + high-level subtask labels + verbal instructions | not stated | + more detailed language + step-by-step human coaching |
| Inference | ||||
| Denoising steps | 10 | 10 | 5 | 5 |
| Chunk execution | open-loop; inference every 0.8 s at 20 Hz (after 16 actions) / every 0.5 s at 50 Hz (after 25 actions); temporal ensembling was tried and hurt performance | not stated | not stated | async with training-time RTC (0–12 step delay, up to 240 ms on 50 Hz robot) |
| Latency | 73 ms onboard / 86 ms off-board (RTX 4090, 3 cams) | not stated | 63 ms on single H100 (3 cams, 5 denoising steps) | 38 ms minimal variant / 127 ms w/ MEM + subgoals (single H100) |
| Classifier-free guidance | — | — | — for π0.6. π*0.6: β ∈ [1.5, 2.5] on episode metadata | β ∈ {1.3, 1.7, 2.2} (moderate values) on any part of prompt |
| RL (π*0.6) / world model (π0.7) | ||||
| Value function | — | — | π*0.6: 670M VLM (Gemma 3 + SigLIP 400M) + value head, 201 discretized return bins, co-trained on web data | — |
| Reward | — | — | π*0.6: rt = 0 (success step T) / −Cfail (fail step T) / −1 otherwise; value normalized to [−1, 0] per task | — |
| Advantage estimation | — | — | π*0.6: N=50 step lookahead (post-train); full trajectory T-step (pre-train) | — |
| Advantage threshold ε_ℓ | — | — | π*0.6: ~30th percentile (pre-train) / ~40th (fine-tune) / 10th (strict T-shirt task) | — |
| Advantage dropout | — | — | π*0.6: 30% (enables test-time CFG) | — |
| World model | — | — | — | 14B BAGEL-init (7B LLM backbone + 7B generation backbone, ViT 448×336, VAE 512×384), 25 denoising steps, 1.25 s/subgoal on 4× H100 w/ 8-bit matmuls + SageAttention |
| Subgoal sampling | — | — | — | 25% of training examples include subgoal images; within those, 25% use end-of-segment, 75% sample uniformly 0–4 s ahead. Subgoals refreshed every Δ=4 s or on subtask change. |
| Prompt dropout (π0.7) | — | — | — | subtask instruction: 30% (when image present) · entire metadata: 15% · each metadata component: +5% · history frames: 30% · rear view: 30% |