Logo

The π model family

Physical Intelligence’s VLA lineage, 2024–2026. Each generation compared across architecture, training recipe, contributions, and limitations. Scroll horizontally to move through the timeline.

Quick synthesis

Backbone: Gemma 2B → 2.6B → 4B (and stabilizes at 4B). Action expert: 300M → 300M → 860M (big jump at π0.6).

Training recipe: pure flow matching (π0) → hybrid FAST + flow, two-stage (π0.5) → Knowledge Insulation with stop-gradient0.6+). KI is the structural backbone for everything after.

Supervision evolution: demos-only → + hierarchy/verbal/web → + metadata → + RL reward + advantage conditioning + interventions → + subgoal images + diverse “how” metadata + autonomous/failure data + egocentric human video.

Philosophy shift per generation: “can we generalize to new homes?” → “can we get specialist-level out-of-box performance?” → “can VLAs learn from experience via RL?” → “can we get compositional generalization via rich prompting?”

π02024
Foundation paper
π0.52025
Generalize to new homes
π0.6 / π*0.6Nov 2025iTwo separate papers, same policy architecture. π0.6 (Nov 17) is the base model trained with supervised KI. π*0.6 (Nov 19) takes π0.6's weights and further trains them with RECAP — offline RL pre-train → SFT → iterative rollouts + interventions, plus a training-only 670M value function. Different checkpoints, but one release cycle.
Specialist-level out of the box + RECAP RL recipe
π0.72026
Compositional generalization
Size3.3B: PaliGemma 3B (SigLIP 400M + Gemma 2B) + 300M action expertSigLIP 400M + Gemma 2.6B + 300M action expert. Two-stage: pre-train 280k steps (discrete only) → post-train 80k steps (adds flow matching).SigLIP 400M + Gemma 3 4B + 860M action expert (same depth as backbone).
π*0.6 adds a separate 670M distributional value function (Gemma 3 + SigLIP 400M, 201 value bins) for RL; identical policy architecture.
~5B: Gemma 3 4B + 400M SigLIP + 860M action expert + MEM video history encoder (6 history frames @ 1s stride, compressed to single-frame token count). Up to 4 cameras + up to 3 subgoal images @ 448×448. Separate 14B BAGEL-initialized world model for subgoal generation.
Key ideasVLM + separate action expert MoE branch; conditional flow matching for continuous action chunks (H=50, 10 integration steps, 50 Hz); cross-embodiment pre-trainingHeterogeneous co-training: mobile (~400 h, ~2.4% of pre-training mixture), non-mobile diverse env (ME), lab cross-embod (CE), high-level subtask (HL), verbal instructions (VI), web data (WD). Hierarchical inference in one unified model. Hybrid training: FAST discrete tokens in pre-training, flow matching in post-training.π0.6 (base): Keep π0.5 hierarchy. Bigger backbone + bigger action expert. Metadata conditioning in the prompt. Full Knowledge Insulation (KI) training: VLM predicts FAST tokens, action expert predicts continuous actions, stop-gradient prevents action-expert gradients flowing into the VLM. Up to 4 images at 448×448.

π*0.6 (RECAP RL on top): adds binarized advantage indicator It to the prompt (CFGRL-style). (1) Train a multi-task value function via cross-entropy on discretized MC returns. (2) Per-task threshold ε at ~30–40th percentile. (3) Offline RL pre-train → SFT → iterate: autonomous rollouts + HG-DAgger interventions → retrain value + policy. (4) Sparse reward: −1 per step, −Cfail on failure, 0 on success; value = expected negative steps-to-completion.
Diverse prompt conditioning: detailed language + subtask ℓ̂ + multi-view subgoal images + metadata (speed/quality/mistake flag) + control mode. Each component randomly dropped during training → flexible inference.
Language coaching: humans give step-by-step instructions to teach new long-horizon tasks without action data, then distill into a high-level policy.
Train on much more diverse data including suboptimal autonomous rollouts (incl. π*0.6 RL training data), failures, egocentric human video. Subgoals refreshed on subtask change or every 4s.
Core contributionsFirst flow-matching VLA with high-frequency chunks; pre-training/post-training recipe analogous to LLMs; ~10,000 hours — largest robot experiment at the timeFirst end-to-end system doing 10–15 min dexterous tasks in entirely new homes. Co-training recipe — each ablation hurts. 104-location model matches one trained directly on test homes.π0.6: specialist-level out-of-box performance without task-specific fine-tuning. Folds laundry reliably + fully assembles boxes 20% out of box (previously needed fine-tuning for non-zero success). ~2× throughput vs π0.5. 63 ms/chunk on single H100.

π*0.6: first general RL recipe that scales to large flow-matching VLAs via advantage conditioning. Handles good + bad data together. >2× throughput, ~2× failure reduction on hardest tasks. 13 h continuous espresso, factory-grade box assembly. Beats AWR and PPO baselines.
Strong signs of compositional generalization — the “grand challenge”.
(1) Out-of-box specialist-level dexterity matching π*0.6 RL specialists, no RL or fine-tuning needed.
(2) Zero-shot cross-embodiment transfer — folds t-shirts on bimanual UR5e (never trained for this), matching expert teleoperators (85.6%/80% vs 90.9%/80.6%).
(3) Discovers new manipulation strategies suited to target embodiment (vertical grasps on UR5e).
(4) New short-horizon tasks out of box.
(5) New long-horizon tasks via coaching → autonomous distillation.
(6) Breaks dataset biases (reverse bussing).
Results & shortcomingsBeat OpenVLA/Octo by large margins on shirt folding, bussing, bagging. Fine-tuning needed for laundry/boxes.
Shortcomings: not all tasks work reliably, data composition unclear, positive transfer to navigation/locomotion unexplored, language following relies on sub-task decomposition.
Strong on dishes-in-sink, items-in-drawer, laundry, make-bed in 3 real unseen homes.
Shortcomings: persistent challenges with unfamiliar affordances (weird handles, hard-to-open cabinets), partial observability (arm occluding spill), high-level inference loops (open/close drawer repeatedly), only simple prompts, modest context window.
π0.6: big gains on shirt/laundry folding, box assembly, table bussing, mobile tasks, generalization benchmarks. Shortcomings: still relies on fine-tuning for some specialists; card doesn't deeply analyze failure modes.

π*0.6: espresso, box assembly, laundry all reach 90%+. Shortcomings: not fully autonomous — needs human reward labels, interventions, resets. Naive exploration. Iterated offline RL (not concurrent online). Corrections don't fix overall speed or subtle behaviors. MC on-policy value — could benefit from off-policy Q-learning.
Seen tasks often >90%; unseen tasks / novel embodiment combos 60–80%. Mixed-quality data scaling: without metadata, more/noisier data hurts; with metadata it keeps improving.
Shortcomings: unseen success rate still well below in-dist. Very hard to define what's “truly unseen” in such a diverse dataset. World model inference is expensive (1.25 s/subgoal with 4×H100 + 8-bit + SageAttention). Only manipulation, no navigation/locomotion claims.
Diff vs prevFoundation paper.Hierarchical inference in the same model0 used external VLM). Introduced FAST + hybrid training (π0 was pure flow matching). Mobile-manipulation focus. Web co-training. Verbal instruction demos as novel supervision modality.π0.6 vs π0.5: Gemma 2.6B → 4B. Action expert 300M → 860M (~3×). Metadata conditioning. Full KI recipe.
π*0.6 vs π0.6: adds advantage indicator token in the prompt + 670M value function + full RL training loop (offline RL pre-train → SFT → K iterations of rollouts + retraining). Policy architecture identical; different checkpoint produces task-specific RL specialists.
Multi-modal prompt (language + subgoals + metadata + control mode). MEM memory encoder. Separate BAGEL-based world model. Egocentric human video + suboptimal autonomous data + π*0.6 RL rollouts in training. Training-time RTC for latency tolerance. Emphasis on steerability and emergent generalization rather than pure throughput/robustness.
AppendixBlockwise causal mask with 3 blocks ([images+text], [state], [noisy actions]). Action expert: width=1024, mlp_dim=4096. Beta(1.5,1) timestep with s=0.999 (up to 1,000 integration steps). Temporal ensembling was tried and hurt performance — chunks executed open-loop. 73 ms onboard / 86 ms off-board on RTX 4090.Action expert tokens do NOT attend to FAST tokens (avoids info leakage between two action representations). Adaptive RMSNorm for τ injection (vs. input fusion in π0). α=10.0 post-training. Image aug: crop 0.95×, rotate ±5°, color jitter.π0.6 card: no formal appendix — 4-page card. Image tokens bidirectional, text tokens now causal.

π*0.6: advantage dropout 30% for test-time CFG. β ∈ [1.5, 2.5] — high β pushes actions to support boundaries (aggressive). PPO baseline needed SPO-style constraint with η=0.01 for stability. Detailed flow-matching ELBO decomposition (AR + diffusion). Data per task: T-shirt laundry uses autonomous-only data (no corrections); diverse laundry uses 450 autonomous + 287 correction eps; box assembly uses 600 autonomous + 360 correction eps/iter on 3 robots; cafe uses 414 autonomous + 429 correction eps.
Subgoal sampling: 25% of examples have subgoal images; within those, 25% use end-of-segment, 75% sample 0–4s ahead uniformly. Subtask instruction dropped 30% when image is present. Metadata dropped 15% entirely; each component +5% individually. State uses linear projection, not text tokenization (change from π0.6). Minimal variant: 38 ms/chunk on single H100; 127 ms with MEM + subgoal. World model: 4×H100 tensor-parallel, 8-bit matmuls, SageAttention → 1.25 s per subgoal, 25 denoising steps. Ablation on prior models0.5, π0.6): EE control showed no substantial advantage over joint control on cross-embodiment tasks, so main cross-embodiment experiments use joint-space control.
← scroll horizontally to compare →

Model & training parameters

Concrete numbers pulled from each paper: architecture configs, training recipe, data composition, inference latency, and RL/world-model specifics. Dashes mean the component doesn’t exist for that generation.

π0π0.5π0.6 / π*0.6π0.7
Architecture
Total params3.3B400M + 2.6B + 300M (total not stated)400M + 4B + 860M (total not stated). π*0.6: + 670M value function~5B + separate 14B world model
VLM backbonePaliGemma 3B (SigLIP 400M + Gemma 2B)SigLIP 400M + Gemma 2.6BGemma 3 4B (policy). π*0.6: value function uses same Gemma 3 + SigLIP 400M architectureGemma 3 4B (incl. 400M SigLIP)
Action expert300M300M860M (same depth as backbone); identical in π*0.6860M
Action expert configwidth=1024, mlp_dim=4096width=1024, mlp_dim=4096not statednot stated
Image resolutionnot statednot stated448×448448×448 (VAE inputs 512×384 in WM)
Max cameras2–3 per robotup to 4 (front/back/2× wrist)up to 4 (base/2× wrist/optional back). π*0.6: experiments use 3 (base + 2× wrist) on static bimanualup to 4 + up to 3 subgoal images
History frames6 @ 1s stride (MEM encoder)
State encodinglinear projectiondiscretized text tokensdiscretized text tokenslinear projection
Attention patternblockwise causal, 3 blocks: [images+text] [state] [actions]prefix mask on images/prompt/state; FAST causal on prefix + prior FAST; action expert bidir on prefix, no attend to FASTbidir images, causal text, bidir action tokens. π*0.6: adds advantage indicator token in promptblock-causal: obs + subgoal bidir within; goal images attend obs; text causal
Action chunk length50 (H=50 in paper)50 (H=49 in paper)not stated50 (exec 15–25 steps per chunk)
Control frequencyup to 50 Hz (20 Hz on UR5e/Franka)50 Hznot stated. π*0.6: experiments: 50 Hz (static bimanual, joint)50 Hz (20 Hz on UR5e)
Training recipe
Objectiveconditional flow matchinghybrid: FAST tokens (pre) + flow matching (post)Knowledge Insulation (FAST in VLM, flow in action expert, stop-grad). π*0.6: adds advantage-conditioned policy (CFGRL-style)KI + diverse prompt dropout
Stagespre-train + task-specific post-trainpre-train 280k (discrete only, α=0) + post-train 80k (adds flow, α=10)single-stage KI. π*0.6:: offline RL pre-train → SFT → K iterations of { collect data, retrain V, retrain π }not stated
Timestep distributionBeta((s−τ)/s; 1.5, 1), s=0.999same as π0not statednot stated
Timestep injectionMLP fused into action token embeddingseparate MLP + adaptive RMSNorm per layernot statedadaptive RMSNorm
Loss weight α (post)10.0
Image augmentationnot statedRandomCrop 0.95×, Rotate ±5°, ColorJitter(0.3/0.4/0.5)not statednot stated
Action normalizationnot stated[−1, 1] via 1/99% quantile per dimnot statednot stated
Data
Own robot data~10,000 h / 903M timesteps / 7 robot configs / 68 tasks~400 h mobile manipulation (2.4% of pre-training mixture) + diverse non-mobile + lab cross-embodiment“largely inherits” π0.5 composition. π*0.6: adds on-policy rollouts per task: diverse laundry (450 auto + 287 correction), box (600 auto + 360 correction / iter on 3 robots), cafe (414 auto + 429 correction); T-shirt laundry is auto-onlydemonstrations + autonomous rollouts (incl. π*0.6 RL data) + failures + egocentric human video
External dataOXE (9.1%), Bridge v2, DROIDOXE + multimodal websame as π0.5 (per card)same + open-source image-editing + open video datasets (for world model)
Web co-training taskscaptioning, VQA, object localization (bounding box / keypoint)bounding box + keypoint prediction + general multi-modal websame + video captioning (robot + web)
Metadata in prompt“conditioning metadata” in prompt (content not specified in card). π*0.6: adds binarized advantage indicator Itoverall speed (500-step bins, e.g. “2000”) + quality (1–5) + mistake (bool) + control mode (joint/ee)
Language supervisiontask names + ~2s segment annotations+ high-level subtask labels + verbal instructionsnot stated+ more detailed language + step-by-step human coaching
Inference
Denoising steps101055
Chunk executionopen-loop; inference every 0.8 s at 20 Hz (after 16 actions) / every 0.5 s at 50 Hz (after 25 actions); temporal ensembling was tried and hurt performancenot statednot statedasync with training-time RTC (0–12 step delay, up to 240 ms on 50 Hz robot)
Latency73 ms onboard / 86 ms off-board (RTX 4090, 3 cams)not stated63 ms on single H100 (3 cams, 5 denoising steps)38 ms minimal variant / 127 ms w/ MEM + subgoals (single H100)
Classifier-free guidance— for π0.6. π*0.6: β ∈ [1.5, 2.5] on episode metadataβ ∈ {1.3, 1.7, 2.2} (moderate values) on any part of prompt
RL (π*0.6) / world model (π0.7)
Value functionπ*0.6: 670M VLM (Gemma 3 + SigLIP 400M) + value head, 201 discretized return bins, co-trained on web data
Rewardπ*0.6: rt = 0 (success step T) / −Cfail (fail step T) / −1 otherwise; value normalized to [−1, 0] per task
Advantage estimationπ*0.6: N=50 step lookahead (post-train); full trajectory T-step (pre-train)
Advantage threshold ε_ℓπ*0.6: ~30th percentile (pre-train) / ~40th (fine-tune) / 10th (strict T-shirt task)
Advantage dropoutπ*0.6: 30% (enables test-time CFG)
World model14B BAGEL-init (7B LLM backbone + 7B generation backbone, ViT 448×336, VAE 512×384), 25 denoising steps, 1.25 s/subgoal on 4× H100 w/ 8-bit matmuls + SageAttention
Subgoal sampling25% of training examples include subgoal images; within those, 25% use end-of-segment, 75% sample uniformly 0–4 s ahead. Subgoals refreshed every Δ=4 s or on subtask change.
Prompt dropout (π0.7)subtask instruction: 30% (when image present) · entire metadata: 15% · each metadata component: +5% · history frames: 30% · rear view: 30%
← scroll horizontally to compare →