The π model family

Physical Intelligence’s VLA lineage, 2024–2026. Each generation compared across architecture, training recipe, contributions, and limitations. Scroll horizontally to move through the timeline.

Quick synthesis

Backbone: Gemma 2B → 2.6B → 4B (and stabilizes at 4B). Action expert: 300M → 300M → 860M (big jump at π_0.6).

Training recipe: pure flow matching (π₀) → hybrid FAST + flow, two-stage (π_0.5) → Knowledge Insulation with stop-gradient (π_0.6+). KI is the structural backbone for everything after.

Supervision evolution: demos-only → + hierarchy/verbal/web → + metadata → + RL reward + advantage conditioning + interventions → + subgoal images + diverse “how” metadata + autonomous/failure data + egocentric human video.

Philosophy shift per generation: “can we generalize to new homes?” → “can we get specialist-level out-of-box performance?” → “can VLAs learn from experience via RL?” → “can we get compositional generalization via rich prompting?”

	π₀2024 Foundation paper	π_0.52025 Generalize to new homes	π_0.6 / π_0.6Nov 2025iTwo separate papers, same policy architecture. π0.6 (Nov 17) is the base model trained with supervised KI. π0.6 (Nov 19) takes π0.6's weights and further trains them with RECAP — offline RL pre-train → SFT → iterative rollouts + interventions, plus a training-only 670M value function. Different checkpoints, but one release cycle. Specialist-level out of the box + RECAP RL recipe	π_0.72026 Compositional generalization
Size	3.3B: PaliGemma 3B (SigLIP 400M + Gemma 2B) + 300M action expert	SigLIP 400M + Gemma 2.6B + 300M action expert. Two-stage: pre-train 280k steps (discrete only) → post-train 80k steps (adds flow matching).	SigLIP 400M + Gemma 3 4B + 860M action expert (same depth as backbone). π_0.6 adds a separate 670M distributional value function* (Gemma 3 + SigLIP 400M, 201 value bins) for RL; identical policy architecture.	~5B: Gemma 3 4B + 400M SigLIP + 860M action expert + MEM video history encoder (6 history frames @ 1s stride, compressed to single-frame token count). Up to 4 cameras + up to 3 subgoal images @ 448×448. Separate 14B BAGEL-initialized world model for subgoal generation.
Key ideas	VLM + separate action expert MoE branch; conditional flow matching for continuous action chunks (H=50, 10 integration steps, 50 Hz); cross-embodiment pre-training	Heterogeneous co-training: mobile (~400 h, ~2.4% of pre-training mixture), non-mobile diverse env (ME), lab cross-embod (CE), high-level subtask (HL), verbal instructions (VI), web data (WD). Hierarchical inference in one unified model. Hybrid training: FAST discrete tokens in pre-training, flow matching in post-training.	π_0.6 (base): Keep π_0.5 hierarchy. Bigger backbone + bigger action expert. Metadata conditioning in the prompt. Full Knowledge Insulation (KI) training: VLM predicts FAST tokens, action expert predicts continuous actions, stop-gradient prevents action-expert gradients flowing into the VLM. Up to 4 images at 448×448. *π_0.6 (RECAP RL on top):** adds binarized advantage indicator I_t to the prompt (CFGRL-style). (1) Train a multi-task value function via cross-entropy on discretized MC returns. (2) Per-task threshold ε_ℓ at ~30–40th percentile. (3) Offline RL pre-train → SFT → iterate: autonomous rollouts + HG-DAgger interventions → retrain value + policy. (4) Sparse reward: −1 per step, −C_fail on failure, 0 on success; value = expected negative steps-to-completion.	Diverse prompt conditioning: detailed language + subtask ℓ̂ + multi-view subgoal images + metadata (speed/quality/mistake flag) + control mode. Each component randomly dropped during training → flexible inference. Language coaching: humans give step-by-step instructions to teach new long-horizon tasks without action data, then distill into a high-level policy. Train on much more diverse data including *suboptimal autonomous rollouts (incl. π_0.6 RL training data)**, failures, egocentric human video. Subgoals refreshed on subtask change or every 4s.
Core contributions	First flow-matching VLA with high-frequency chunks; pre-training/post-training recipe analogous to LLMs; ~10,000 hours — largest robot experiment at the time	First end-to-end system doing 10–15 min dexterous tasks in entirely new homes. Co-training recipe — each ablation hurts. 104-location model matches one trained directly on test homes.	π_0.6: specialist-level out-of-box performance without task-specific fine-tuning. Folds laundry reliably + fully assembles boxes 20% out of box (previously needed fine-tuning for non-zero success). ~2× throughput vs π_0.5. 63 ms/chunk on single H100. *π_0.6: first general RL recipe that scales to large flow-matching VLAs via advantage conditioning. Handles good + bad data together. >2× throughput, ~2× failure reduction** on hardest tasks. 13 h continuous espresso, factory-grade box assembly. Beats AWR and PPO baselines.	Strong signs of compositional generalization — the “grand challenge”. (1) Out-of-box specialist-level dexterity matching π_0.6 RL specialists, no RL or fine-tuning needed. (2) Zero-shot cross-embodiment transfer* — folds t-shirts on bimanual UR5e (never trained for this), matching expert teleoperators (85.6%/80% vs 90.9%/80.6%). (3) Discovers new manipulation strategies suited to target embodiment (vertical grasps on UR5e). (4) New short-horizon tasks out of box. (5) New long-horizon tasks via coaching → autonomous distillation. (6) Breaks dataset biases (reverse bussing).
Results & shortcomings	Beat OpenVLA/Octo by large margins on shirt folding, bussing, bagging. Fine-tuning needed for laundry/boxes. Shortcomings: not all tasks work reliably, data composition unclear, positive transfer to navigation/locomotion unexplored, language following relies on sub-task decomposition.	Strong on dishes-in-sink, items-in-drawer, laundry, make-bed in 3 real unseen homes. Shortcomings: persistent challenges with unfamiliar affordances (weird handles, hard-to-open cabinets), partial observability (arm occluding spill), high-level inference loops (open/close drawer repeatedly), only simple prompts, modest context window.	π_0.6: big gains on shirt/laundry folding, box assembly, table bussing, mobile tasks, generalization benchmarks. Shortcomings: still relies on fine-tuning for some specialists; card doesn't deeply analyze failure modes. *π_0.6:** espresso, box assembly, laundry all reach 90%+. Shortcomings: not fully autonomous — needs human reward labels, interventions, resets. Naive exploration. Iterated offline RL (not concurrent online). Corrections don't fix overall speed or subtle behaviors. MC on-policy value — could benefit from off-policy Q-learning.	Seen tasks often >90%; unseen tasks / novel embodiment combos 60–80%. Mixed-quality data scaling: without metadata, more/noisier data hurts; with metadata it keeps improving. Shortcomings: unseen success rate still well below in-dist. Very hard to define what's “truly unseen” in such a diverse dataset. World model inference is expensive (1.25 s/subgoal with 4×H100 + 8-bit + SageAttention). Only manipulation, no navigation/locomotion claims.
Diff vs prev	Foundation paper.	Hierarchical inference in the same model (π₀ used external VLM). Introduced FAST + hybrid training (π₀ was pure flow matching). Mobile-manipulation focus. Web co-training. Verbal instruction demos as novel supervision modality.	π_0.6 vs π_0.5: Gemma 2.6B → 4B. Action expert 300M → 860M (~3×). Metadata conditioning. Full KI recipe. *π_0.6 vs π_0.6:** adds advantage indicator token in the prompt + 670M value function + full RL training loop (offline RL pre-train → SFT → K iterations of rollouts + retraining). Policy architecture identical; different checkpoint produces task-specific RL specialists.	Multi-modal prompt (language + subgoals + metadata + control mode). MEM memory encoder. Separate BAGEL-based world model. Egocentric human video + suboptimal autonomous data + π*_0.6 RL rollouts in training. Training-time RTC for latency tolerance. Emphasis on steerability and emergent generalization rather than pure throughput/robustness.
Appendix	Blockwise causal mask with 3 blocks ([images+text], [state], [noisy actions]). Action expert: width=1024, mlp_dim=4096. Beta(1.5,1) timestep with s=0.999 (up to 1,000 integration steps). Temporal ensembling was tried and hurt performance — chunks executed open-loop. 73 ms onboard / 86 ms off-board on RTX 4090.	Action expert tokens do NOT attend to FAST tokens (avoids info leakage between two action representations). Adaptive RMSNorm for τ injection (vs. input fusion in π₀). α=10.0 post-training. Image aug: crop 0.95×, rotate ±5°, color jitter.	π_0.6 card: no formal appendix — 4-page card. Image tokens bidirectional, text tokens now causal. *π_0.6:** advantage dropout 30% for test-time CFG. β ∈ [1.5, 2.5] — high β pushes actions to support boundaries (aggressive). PPO baseline needed SPO-style constraint with η=0.01 for stability. Detailed flow-matching ELBO decomposition (AR + diffusion). Data per task: T-shirt laundry uses autonomous-only data (no corrections); diverse laundry uses 450 autonomous + 287 correction eps; box assembly uses 600 autonomous + 360 correction eps/iter on 3 robots; cafe uses 414 autonomous + 429 correction eps.	Subgoal sampling: 25% of examples have subgoal images; within those, 25% use end-of-segment, 75% sample 0–4s ahead uniformly. Subtask instruction dropped 30% when image is present. Metadata dropped 15% entirely; each component +5% individually. State uses linear projection, not text tokenization (change from π_0.6). Minimal variant: 38 ms/chunk on single H100; 127 ms with MEM + subgoal. World model: 4×H100 tensor-parallel, 8-bit matmuls, SageAttention → 1.25 s per subgoal, 25 denoising steps. Ablation on prior models (π_0.5, π_0.6): EE control showed no substantial advantage over joint control on cross-embodiment tasks, so main cross-embodiment experiments use joint-space control.

← scroll horizontally to compare →

Model & training parameters

Concrete numbers pulled from each paper: architecture configs, training recipe, data composition, inference latency, and RL/world-model specifics. Dashes mean the component doesn’t exist for that generation.

	π₀	π_0.5	π_0.6 / π*_0.6	π_0.7
Architecture
Total params	3.3B	400M + 2.6B + 300M (total not stated)	400M + 4B + 860M (total not stated). π*_0.6: + 670M value function	~5B + separate 14B world model
VLM backbone	PaliGemma 3B (SigLIP 400M + Gemma 2B)	SigLIP 400M + Gemma 2.6B	Gemma 3 4B (policy). π*_0.6: value function uses same Gemma 3 + SigLIP 400M architecture	Gemma 3 4B (incl. 400M SigLIP)
Action expert	300M	300M	860M (same depth as backbone); identical in π*0.6	860M
Action expert config	width=1024, mlp_dim=4096	width=1024, mlp_dim=4096	not stated	not stated
Image resolution	not stated	not stated	448×448	448×448 (VAE inputs 512×384 in WM)
Max cameras	2–3 per robot	up to 4 (front/back/2× wrist)	up to 4 (base/2× wrist/optional back). π*_0.6: experiments use 3 (base + 2× wrist) on static bimanual	up to 4 + up to 3 subgoal images
History frames	—	—	—	6 @ 1s stride (MEM encoder)
State encoding	linear projection	discretized text tokens	discretized text tokens	linear projection
Attention pattern	blockwise causal, 3 blocks: [images+text] [state] [actions]	prefix mask on images/prompt/state; FAST causal on prefix + prior FAST; action expert bidir on prefix, no attend to FAST	bidir images, causal text, bidir action tokens. π*_0.6: adds advantage indicator token in prompt	block-causal: obs + subgoal bidir within; goal images attend obs; text causal
Action chunk length	50 (H=50 in paper)	50 (H=49 in paper)	not stated	50 (exec 15–25 steps per chunk)
Control frequency	up to 50 Hz (20 Hz on UR5e/Franka)	50 Hz	not stated. π*_0.6: experiments: 50 Hz (static bimanual, joint)	50 Hz (20 Hz on UR5e)
Training recipe
Objective	conditional flow matching	hybrid: FAST tokens (pre) + flow matching (post)	Knowledge Insulation (FAST in VLM, flow in action expert, stop-grad). π*_0.6: adds advantage-conditioned policy (CFGRL-style)	KI + diverse prompt dropout
Stages	pre-train + task-specific post-train	pre-train 280k (discrete only, α=0) + post-train 80k (adds flow, α=10)	single-stage KI. π*_0.6:: offline RL pre-train → SFT → K iterations of { collect data, retrain V, retrain π }	not stated
Timestep distribution	Beta((s−τ)/s; 1.5, 1), s=0.999	same as π0	not stated	not stated
Timestep injection	MLP fused into action token embedding	separate MLP + adaptive RMSNorm per layer	not stated	adaptive RMSNorm
Loss weight α (post)	—	10.0	—	—
Image augmentation	not stated	RandomCrop 0.95×, Rotate ±5°, ColorJitter(0.3/0.4/0.5)	not stated	not stated
Action normalization	not stated	[−1, 1] via 1/99% quantile per dim	not stated	not stated
Data
Own robot data	~10,000 h / 903M timesteps / 7 robot configs / 68 tasks	~400 h mobile manipulation (2.4% of pre-training mixture) + diverse non-mobile + lab cross-embodiment	“largely inherits” π0.5 composition. π*_0.6: adds on-policy rollouts per task: diverse laundry (450 auto + 287 correction), box (600 auto + 360 correction / iter on 3 robots), cafe (414 auto + 429 correction); T-shirt laundry is auto-only	demonstrations + autonomous rollouts (incl. π*0.6 RL data) + failures + egocentric human video
External data	OXE (9.1%), Bridge v2, DROID	OXE + multimodal web	same as π0.5 (per card)	same + open-source image-editing + open video datasets (for world model)
Web co-training tasks	—	captioning, VQA, object localization (bounding box / keypoint)	bounding box + keypoint prediction + general multi-modal web	same + video captioning (robot + web)
Metadata in prompt	—	—	“conditioning metadata” in prompt (content not specified in card). π*_0.6: adds binarized advantage indicator I_t	overall speed (500-step bins, e.g. “2000”) + quality (1–5) + mistake (bool) + control mode (joint/ee)
Language supervision	task names + ~2s segment annotations	+ high-level subtask labels + verbal instructions	not stated	+ more detailed language + step-by-step human coaching
Inference
Denoising steps	10	10	5	5
Chunk execution	open-loop; inference every 0.8 s at 20 Hz (after 16 actions) / every 0.5 s at 50 Hz (after 25 actions); temporal ensembling was tried and hurt performance	not stated	not stated	async with training-time RTC (0–12 step delay, up to 240 ms on 50 Hz robot)
Latency	73 ms onboard / 86 ms off-board (RTX 4090, 3 cams)	not stated	63 ms on single H100 (3 cams, 5 denoising steps)	38 ms minimal variant / 127 ms w/ MEM + subgoals (single H100)
Classifier-free guidance	—	—	— for π_0.6. π*_0.6: β ∈ [1.5, 2.5] on episode metadata	β ∈ {1.3, 1.7, 2.2} (moderate values) on any part of prompt
RL (π*0.6) / world model (π0.7)
Value function	—	—	π*_0.6: 670M VLM (Gemma 3 + SigLIP 400M) + value head, 201 discretized return bins, co-trained on web data	—
Reward	—	—	π*_0.6: r_t = 0 (success step T) / −C_fail (fail step T) / −1 otherwise; value normalized to [−1, 0] per task	—
Advantage estimation	—	—	π*_0.6: N=50 step lookahead (post-train); full trajectory T-step (pre-train)	—
Advantage threshold ε_ℓ	—	—	π*_0.6: ~30th percentile (pre-train) / ~40th (fine-tune) / 10th (strict T-shirt task)	—
Advantage dropout	—	—	π*_0.6: 30% (enables test-time CFG)	—
World model	—	—	—	14B BAGEL-init (7B LLM backbone + 7B generation backbone, ViT 448×336, VAE 512×384), 25 denoising steps, 1.25 s/subgoal on 4× H100 w/ 8-bit matmuls + SageAttention
Subgoal sampling	—	—	—	25% of training examples include subgoal images; within those, 25% use end-of-segment, 75% sample uniformly 0–4 s ahead. Subgoals refreshed every Δ=4 s or on subtask change.
Prompt dropout (π0.7)	—	—	—	subtask instruction: 30% (when image present) · entire metadata: 15% · each metadata component: +5% · history frames: 30% · rear view: 30%

← scroll horizontally to compare →