--- title: RoboReplan emoji: ๐Ÿค– colorFrom: blue colorTo: purple sdk: docker app_port: 7860 pinned: false --- # RoboReplan โ€” Tabletop Robot Planning Environment **Hackathon Problem Statement 3.1 โ€” World Modeling: Professional Tasks** > Agents must maintain consistent internal state, update beliefs based on outcomes, > and orchestrate multi-step workflows in a dynamic, partially observable world. --- ## The Problem LLMs fail at long-horizon robotic tasks not because they can't move, but because **they can't replan**. When a grasp slips, when a blocker appears, when the instruction changes mid-task โ€” the model freezes, repeats the same failing action, or abandons the plan entirely. RoboReplan benchmarks exactly this failure mode and trains agents to recover from it. --- ## What RoboReplan Tests A tabletop scene with 2โ€“5 objects and 1โ€“2 target bins. The agent receives a natural-language instruction and must: - **Decompose** the instruction into an ordered plan - **Handle blockers** โ€” clear whatever is in the way before picking the target - **Replan after failures** โ€” grasp slips, partial clears, and perception noise require retry logic - **Respect constraints** โ€” fragile first, heavy last, urgent first - **Track state** โ€” know what's placed, what's held, what's failed, across many steps - **Adapt mid-task** โ€” instructions can change at step 6 or 12; the agent must update its plan ### Professional Task Skins (PS 3.1) Switch the `/viz` scene selector to run the same mechanics in domain-appropriate settings: | Pack | Example instruction | |---|---| | **Default** | "Place the red block in bin A. Handle fragile items first." | | **Pharmacy** | "Place the morphine vial in bin A, then the insulin pen in bin B. Prioritize urgent items first." | | **Warehouse** | "Place the fragile package in bin A. Move heavy items last." | | **Lab** | "Place reagent-ฮฑ in bin A, then catalyst-ฮฒ in bin B by step 8." | --- ## Environment Details ### Action Space (16 actions) | Category | Actions | |---|---| | Direct navigation | `MOVE_TO_RED` `MOVE_TO_BLUE` `MOVE_TO_GREEN` `MOVE_TO_YELLOW` `MOVE_TO_PURPLE` | | Grid navigation (hard) | `MOVE_NORTH` `MOVE_SOUTH` `MOVE_EAST` `MOVE_WEST` `ROTATE_LEFT` `ROTATE_RIGHT` | | Manipulation | `PICK` `PLACE_BIN_A` `PLACE_BIN_B` `CLEAR_BLOCKER` | | Sensing | `SCAN_SCENE` | ### Observation (structured text) Every step the agent sees: task instruction, scene state, held object, completed subgoals, known failures, active constraints, action history, valid actions now, distance to next goal, and deadline status. ### Reward Structure | Signal | Value | |---|---| | Task complete | +10 | | Efficiency bonus (steps saved) | 0 to +5 | | Correct placement | +2 | | Successful pick | +2 | | Blocker cleared | +2 | | Recovery after failure | +1 | | Reasoning quality bonus | 0 to +1.5 (scales with chain-of-thought length and content) | | Wrong bin | -3 | | First new failure | -1 | | Repeated same failure | -2.5 | | Constraint violation | -4 | | Missed deadline | -1 per step late | | Step cost | -0.05 | | Timeout | -10 | --- ## Three-Level Curriculum | Level | Objects | Blockers | Realism | Scripted Ceiling | |---|---|---|---|---| | **Easy** | 2โ€“5 | 0โ€“1 | None | **100%** | | **Medium** | 2โ€“5 | 0โ€“2 | Grasp slip (15%), partial clear (20%), perception noise (10%), hidden objects (30%) | **~98%** | | **Hard** | 2โ€“5 | 0โ€“3 | All medium + object drift (2%), deadlines, mid-task instruction changes (35%), navigation mode, adversarial sampling (25%) | **~87%** | Scripted-ceiling numbers verified over 3 seeds ร— 30 episodes = 270 episodes per level. The curriculum auto-advances when rolling success โ‰ฅ 75% across 20 episodes, and retreats if it drops below 35%. --- ## Reasoning-Augmented Actions The model reasons in `` tags before each action. This is rewarded (up to +1.5 per step) when the reasoning correctly identifies the blocked object, target bin, or relevant constraint โ€” with longer, more detailed chain-of-thought earning higher reward. **Before training (random policy):** ``` I'm not sure what to do. SCAN_SCENE ``` **After GRPO training:** ``` Plan: CLEAR_BLOCKER โ†’ MOVE_TO_RED โ†’ PICK โ†’ PLACE_BIN_A. Red block is blocked by blue. Clearing blocker first. CLEAR_BLOCKER ``` --- ## API ```python from openenv import AutoEnv env = AutoEnv.from_env("openenv-community/robo-replan") obs = env.reset() result = env.step({"action": "CLEAR_BLOCKER", "reasoning": "blocker in the way"}) ``` ### Endpoints | Method | Path | Description | |---|---|---| | `GET` | `/health` | Liveness check | | `GET` | `/schema` | Action/observation schema | | `POST` | `/reset` | Start new episode (`?difficulty=easy\|medium\|hard&scenario_pack=default\|pharmacy\|warehouse\|lab`) | | `POST` | `/step` | Take one action, get observation + reward | | `GET` | `/viz` | Interactive browser visualization | **If the Space is broken for the env:** Ensure the Space is built from this repo (same `Dockerfile` and `server/`). The app listens on `$PORT` (default 7860). Rebuild the Space (Factory โ†’ Restart) after pulling latest. For `AutoEnv.from_env("openenv-community/robo-replan")` to work, the Space must be running and expose `/health`, `/schema`, `/reset`, `/step`. --- ## Domain Randomization Every episode randomizes: which objects appear (2โ€“5), which are targets (1โ€“2), which block which, object positions, constraint type, hidden object traits (fragile/heavy/standard), and whether deadlines apply. The model cannot memorize layouts โ€” it must generalize. --- ## Real-World Impact The same replanning mechanics run across four professional domains. A trained agent that clears blockers and recovers from failures translates directly to fewer manual interventions and faster task completion: | Domain | Failure mode without replanning | With RoboReplan-trained agent | |---|---|---| | **Pharmacy** | Misprioritizes urgent/fragile meds; re-dose required | Correct priority order, constraint violations: 0 | | **Warehouse** | Re-sorts entire pallet when unexpected blocker found | Clears blocker in-place; task completes in minimum steps | | **Lab** | Abandons protocol when reagent position shifts | Replans around obstacle; meets deadline constraint | | **Default** | Loops on SCAN_SCENE when blocked; times out | Identifies blocker, clears it, picks and places correctly | The key lever: our reward penalises **repeated failures** (โˆ’2.5) more than first attempts (โˆ’1), and gives a **recovery bonus** (+1) when the agent succeeds after a failure. This trains the model to replan rather than loop. --- ## Training Results Training uses Group Relative Policy Optimization (GRPO) โ€” no value function, just online RL against the live environment reward. Two phases: SFT warm-start on scripted demonstrations, then GRPO to exceed them. ### Results (Qwen2.5-0.5B-Instruct, Northflank H100) | Metric | Before (random) | After (SFT + GRPO) | |---|---|---| | Success rate | **0%** | **78%** | | Avg reward / episode | **-29.9** | **+8.2** | ![Training Results](training_results.png) Full training run via `train/run_training.py` on H100. Lightweight reproducible version: `train/colab_train.ipynb` (runs on free Colab T4 or Kaggle GPU). The notebook also plots **GRPO reward over time** (batch mean + smoothed curve) and saves `grpo_reward_over_time.png`. **How to run the notebook (Colab):** Open [train/colab_train.ipynb](https://colab.research.google.com/github/jwalin-shah/robo-replan/blob/main/train/colab_train.ipynb) in Colab โ†’ **Runtime โ†’ Change runtime type โ†’ T4 GPU** โ†’ Run all cells (~40โ€“60 min). Quick test: run only cells 1โ€“2 to verify setup (clone, env import). ### Reward shaping for training Training weights differ from eval to reduce reward hacking: - `task_complete: +25` (completion dominates โ€” prevents partial-credit gaming) - `wrong_bin: -6`, `constraint_violation: -6` (hard penalties for semantic errors) - `repeated_failure: -3.5` (punishes loops) --- ## Hackathon Compliance - **Open source**: this repository - **OpenEnv**: uses `openenv-core==0.2.1` - **HF Space**: `openenv-community/robo-replan` - **Training**: GRPO via `train/colab_train.ipynb` (Colab T4) or `train/run_h100_1.5b.sh` (H100) - **Problem statement**: 3.1 โ€” World Modeling, Professional Tasks ### Submission evidence - Scripted ceiling: 100% easy, ~98% medium, ~87% hard (verified, 270 Hard episodes) - Trained policy: 100% easy, ~95% medium (see training logs and `training_results.png`) - Failure trajectory (pre-training): model scans repeatedly, ignores blocker, times out - Success trajectory (post-training): model identifies blocker, clears it, picks and places correctly - Space links: `/health` ยท `/schema` ยท `/viz` --- ## Hackathon Judging Criteria โ€” How We Meet Them | Criterion | Weight | What we provide | |---|---|---| | **Environment Innovation** | 40% | Novel mid-task replanning challenge: instruction changes at steps 6 and 12, grasp failures, partial observability, deadlines, blockers, and ordering constraints. Four domain skins (default, pharmacy, warehouse, lab) ground the same mechanics in PS 3.1 "Professional Tasks" scenarios. Three-level curriculum with domain randomization ensures the model cannot memorize layouts. | | **Storytelling** | 30% | Clear before/after: random model loops on SCAN_SCENE and times out; trained model reasons "red block is blocked โ†’ CLEAR_BLOCKER โ†’ PICK โ†’ PLACE_BIN_A." The `/viz` UI shows instruction, scene state, mid-task change banner (orange flash), and full reasoning trace in real time. Switch to Pharmacy pack for a professional-tasks narrative. | | **Training script showing improvement** | 20% | `train/colab_train.ipynb` runs SFT + GRPO end-to-end on a free T4, prints before/after success rates, saves `training_results.png` and `grpo_reward_over_time.png` (reward curve over training). The GRPO reward function correctly replays action history to evaluate each completion at the exact env state shown in its prompt. | | **Reward and training pipeline** | 10% | Reward table above; reasoning bonus (0โ€“1.5) incentivises chain-of-thought. GRPO reward is computed by stepping the live env so improvement in reasoning directly improves task completion. Training weights amplify task completion (+25) and penalise semantic errors (-6 wrong bin, -6 constraint violation) to prevent partial-credit gaming. | **Demo checklist for judges** 1. Open the Space โ†’ pick **Pharmacy** pack โ†’ set difficulty to **Medium** โ†’ click **Reset** 2. Click **โ–ถ Run Agent** โ€” watch the untrained model struggle (scan loops, missed blockers) 3. Reset โ†’ click **๐ŸŽฏ Run Oracle** โ€” see optimal reasoning trace in the `๐Ÿ’ญ` box 4. Point to `training_results.png` (and `grpo_reward_over_time.png`) or Colab output for before/after numbers 5. Story: "RoboReplan trains LLMs to replan โ€” clear blockers, recover from grasp failures, and adapt when the instruction changes mid-task."