Minbook
KO
The Ralph Loop Implementation Guide — From a Bash One-Liner to Cross-Model Review

The Ralph Loop Implementation Guide — From a Bash One-Liner to Cross-Model Review

MJ · · 7 min read

Starting from while true + cat task.md, building up through stop hooks, file-based state persistence, and cross-model worker-reviewer separation. Three practical examples — coding migration, prompt refinement, and test coverage expansion — plus analysis of the open-source ecosystem and Korea's Ralphathon.

The previous post covered the evolutionary arc from RLHF to context collapse — the structural limits that every long-running AI agent loop eventually hits. This post goes deeper into how the Ralph Loop solves those problems, at the implementation level.

Expect code examples, file structures, architecture diagrams, and practical applications that extend well beyond coding into prompt refinement and quality engineering.


The Minimal Implementation: while true + task.md

The core of the Ralph Loop is disarmingly simple.

#!/bin/bash
while true; do
  cat task.md | claude -p
done

Three lines. That is the entire thing. But behind this simplicity sits a deliberate design philosophy.

What Each Line Does

CodeRole
while trueInfinite loop. Restarts the agent immediately upon termination
cat task.mdPipes the original task specification into stdin on every iteration
claude -pNon-interactive mode. Executes the prompt and exits automatically

Every time a new iteration starts, the previous session’s conversation history is completely destroyed. Think of it like memory allocation (malloc) for LLMs — the model reconstructs only the information it needs by reading files from disk. Faulty reasoning from prior attempts, broken code patterns, irrelevant error logs — all of it evaporates when the session ends. Only real, tangible progress in the codebase survives.

Stop Hook: Preventing the Agent from Escaping

The basic while loop has one flaw. When the LLM declares “task complete” and exits, the loop restarts the same task from scratch. You end up re-running finished work.

The Stop Hook intercepts the agent’s termination attempt and re-injects the original task specification.

# Claude Code stop hook example (.claude/hooks/stop.sh)
#!/bin/bash
# Runs when the agent attempts to exit
if ! grep -q "ALL_TASKS_COMPLETE" ralph/progress.md; then
  echo "Incomplete tasks remain. Re-read task.md."
  exit 1  # Block termination → agent continues running
fi

With this in place, the agent cannot exit until every task is genuinely done. It is a structural safeguard against success bias — the tendency of LLMs to declare victory prematurely.


The File System as Long-Term Memory

If the conversation history is wiped on every iteration, the agent loses track of where it is. The Ralph Loop solves this by treating the local file system as the model’s persistent memory. Files replace the context window as the source of continuity.

State File Structure

project/
├── task.md              ← Original PRD. Immutable source of truth
├── .ralph/
│   ├── iteration.txt    ← Current loop count
│   ├── work-summary.txt ← What the Worker did this iteration
│   ├── feedback.txt     ← Reviewer's error notes and revision directives
│   └── progress.md      ← Checklist-format progress tracker
├── src/                 ← Actual code (managed by Git)
└── tests/               ← Test code (the success/failure oracle)
FileCreated byRead byPurpose
task.mdHumanWorker (every loop)Immutable task spec. Starting point for every iteration
iteration.txtSystemWorkerTracks which loop number the system is on
work-summary.txtWorkerReviewerSummarizes what changed this iteration
feedback.txtReviewerNext WorkerPrevious failure analysis + revision direction
progress.mdWorkerWorker + Stop HookChecklist. Tracks incomplete items

The governing principle: The LLM’s context window is volatile working memory. The file system is persistent long-term memory.


Cross-Model Review Architecture

Generating code in an infinite loop without oversight can destroy a system. Modern Ralph Loop implementations introduce a cross-model review architecture that physically separates the Worker and the Reviewer.

sequenceDiagram
    participant Loop as Bash Loop
    participant W as Worker (Claude Sonnet)
    participant R as Reviewer (GPT-4o)
    participant FS as File System

    Loop->>FS: cat task.md
    FS->>W: task.md + feedback.txt
    W->>FS: Code changes + work-summary.txt
    W->>Loop: Exit
    Loop->>FS: cat work-summary.txt
    FS->>R: work-summary.txt + changed code
    R->>R: Run tests + code review

    alt Pass
        R->>FS: feedback.txt = "SHIP"
        R->>Loop: Exit
        Loop->>Loop: Break loop
    else Fail
        R->>FS: feedback.txt = "REVISE: specific critique"
        R->>Loop: Exit
        Loop->>Loop: Next iteration
    end

Worker Recipe (ralph-work.yaml)

The Worker model (e.g., Claude Sonnet) executes the following sequence:

  1. Read task.md for the full objective
  2. Read feedback.txt for the Reviewer’s prior critique
  3. Read progress.md to identify incomplete checklist items
  4. Execute code changes
  5. Write a change summary to work-summary.txt
  6. Terminate the session

Reviewer Recipe (ralph-review.yaml)

An entirely different model (e.g., GPT-4o) boots with a clean context.

  1. Inspect the Worker’s changes via git diff
  2. Run the test suite
  3. Review code quality

If all requirements are met → write SHIP to feedback.txt → loop exits. If defects are found → write REVISE + specific feedback to feedback.txt → next Worker iteration fires.

Why Use a Different Model?

When the same model generates and reviews, self-confirmation bias takes over. Claude reviewing Claude’s code is disproportionately likely to say “looks good.” Using a different model for the review step structurally breaks this feedback loop. The Reviewer has no memory of the generation process, no sunk-cost attachment, and a different set of internal biases — which is precisely what makes the cross-check effective.


Practical Example 1: CLI Tool Migration (Coding)

The most canonical Ralph Loop use case.

Scenario

A TypeScript-based CLI tool needs a full migration to Python. 42 files, 180 functions.

task.md

# Task: TypeScript → Python Migration

## Goal
Migrate the CLI tool from TypeScript to Python 3.12.
All existing tests must pass in the Python version.

## Checklist
- [ ] src/cli.ts → src/cli.py
- [ ] src/parser.ts → src/parser.py
- [ ] src/formatter.ts → src/formatter.py
- [ ] ... (42 files total)
- [ ] All unit tests pass
- [ ] All integration tests pass
- [ ] Type hints complete (mypy strict)

Execution Log

Iteration 1: cli.py, parser.py migrated. Tests 12/42 passing.
Iteration 2: formatter.py migrated. Feedback: "datetime handling mismatch." Fixed. Tests 20/42.
Iteration 3: Remaining modules migrated. Tests 35/42.
Iteration 4: Debug 7 failing tests. Tests 40/42.
Iteration 5: Fix 2 edge cases. Tests 42/42. SHIP.

Five iterations, each with a fresh context. Had this been attempted in a single session, context collapse would have started around iteration 3. By iteration 5, the model would be hallucinating fixes for bugs it introduced three turns ago.


Practical Example 2: Prompt Quality Refinement (Non-Coding)

The Ralph Loop is not coding-specific. It applies identically to iterative refinement of prompts, configurations, and workflows.

Scenario

A Slack bot’s daily news curation output is unsatisfactory. Insights are shallow, sources are biased toward a narrow set, and irrelevant articles keep leaking through.

File Structure

project/ralph/
├── PROMPT.md            ← Worker instructions
├── quality-spec.md      ← Quality rubric (= task.md equivalent)
├── sample-input.md      ← Test news data
├── iteration.txt        ← Loop count
├── feedback.md          ← Reviewer feedback
├── evaluation.md        ← (Worker-generated) scoring results
└── simulated-output.md  ← (Worker-generated) simulated curation

Per-Iteration Worker Sequence

  1. Read feedback.md for the Reviewer’s prior notes
  2. Read the current curation prompt (src/crons/daily-signal.ts)
  3. Simulate output against sample-input.md news data
  4. Self-score against quality-spec.md criteria (0-10 scale)
  5. If score is below 8, revise only the 1-2 weakest areas of the prompt
  6. Log changes to feedback.md

quality-spec.md Example (Excerpt)

This is the key differentiator. The quality specification file acts as both the success criterion and the scoring rubric, replacing traditional test suites for non-code refinement tasks.

## Scoring Criteria (0-10)

| Dimension | Weight | Criteria |
|---|---|---|
| Relevance | 30% | Ratio of AI strategy/methodology items out of 15 total |
| Insight depth | 25% | Quality of "why this matters" analysis |
| Source diversity | 15% | Coverage of Korean dev communities + international AI blogs |
| Noise removal | 15% | Count of irrelevant articles that leaked through |
| Headline specificity | 15% | Presence of numbers/names + analytical framing |

SHIP threshold: Total score 8.0 or above

The quality-spec.md pattern is worth internalizing. It does for prompt engineering what unit tests do for code — it gives the loop a machine-evaluable success criterion. Without it, the Reviewer has no objective basis for SHIP/REVISE decisions, and the loop degenerates into subjective cycling.

Execution Log

Iteration 1: Added "explain why this matters" instruction to insight prompt. Score 5.2 → 6.1
Iteration 2: Redesigned categories (added "method" category). Score 6.1 → 7.0
Iteration 3: Separated "Must-Read 3" with 2-sentence insight requirement. Score 7.0 → 7.8
Iteration 4: Strengthened EXCLUDE rules for noise filtering. Score 7.8 → 8.3. SHIP.

Zero lines of application code changed. Four iterations of prompt-only modification brought the quality score above threshold. This is the “Prompt Ralph Loop” — the same architecture, the same file-based state management, the same Worker-Reviewer separation, applied to a fundamentally different artifact type.

The implications are worth stating explicitly: any artifact that can be evaluated against a rubric can be Ralph Looped. Documentation quality, configuration tuning, system prompt optimization, translation accuracy — anywhere you can define a quality-spec.md, you can run the loop.


Practical Example 3: Automated Test Coverage Expansion

Scenario

An existing codebase sits at 40% test coverage. The target is 80% or above.

task.md

# Task: Test Coverage 40% → 80%

## Goal
Add unit tests to reach 80% line coverage.
Do not modify existing source code — tests only.

## Rules
- One test file per source file
- Use existing test patterns in tests/ directory
- Run: npm test -- --coverage after each change

## Progress tracking
Update this checklist after each file:
- [ ] src/auth.ts (0% → ?)
- [ ] src/api.ts (30% → ?)
- [ ] src/utils.ts (60% → ?)
- ... (20 files)

This task is ideal for the Ralph Loop. Each iteration is independent (writing tests for individual files), the success criterion is unambiguous (coverage percentage), and failures carry over gracefully — a test file that fails can be debugged in the next iteration without any context from the previous attempt.

The coverage number itself functions as a natural progress tracker. The Worker can read npm test -- --coverage output at the start of each loop and immediately identify which files still need attention. No separate progress parsing required.


The Open-Source Ecosystem

The Ralph Loop concept has rapidly spawned an ecosystem of implementations and variations.

ProjectDescriptionKey Feature
snarktank/ralphCommunity main implementation (10K+ stars)Bash-based, supports multiple AI CLIs
vercel-labs/ralph-loop-agentVercel Labs AI SDK integrationDistributed as npm package
PageAI-Pro/ralph-loopDocker-sandboxed production implementationSafe execution in isolated environments
ClaytonFarr/ralph-playbookMethodology guideStandardized Worker-Reviewer recipes
mikeyobrien/ralph-orchestratorRust-based orchestratorSupports 7 AI backends
Anthropic ralph-wiggumOfficial Claude Code pluginEndorsed by Boris Cherny (lead engineer)

Variant patterns are also proliferating:

  • RALPHA: Recursive Author Loop for Cursor
  • Ralph Mode: Built-in mode within Deep Agents
  • LangChain adapters and multi-CLI integration layers

Explosive Adoption in the Korean Ecosystem

The Korean developer community’s reception of the Ralph Loop deserves its own section — not as a regional footnote, but because Korea produced several firsts that shaped how the pattern is practiced globally.

Developer Slang

geu jagop geunyang ralph dollyeonwa (그 작업 그냥 랄프 돌려놔) = “Just ralph-loop it overnight”

The phrase has become shorthand in Korean dev circles for delegating any repeatable task to an autonomous AI loop. It carries the same casual authority as “just ship it” — the assumption being that if the task spec is clear, the Ralph Loop will converge.

Major Korean Coverage

The pattern was covered extensively across Korean technical media: WikiDocs (Jaehong Park’s Silicon Valley Blog), AI Times, PyTorch Korea, Inflearn (video tutorials), Dale Seo’s engineering blog, GeekNews (news.hada.io), and TILNOTE. Coverage ranged from beginner walkthroughs to production architecture deep-dives.

Alibaba Cloud also published “From ReAct to Ralph Loop: A Continuous Iteration Paradigm for AI Agents” — a technical deep-dive signaling enterprise-level interest in the pattern.

The Ralphathon (March 2026)

Korea hosted the world’s first hackathon built entirely around the Ralph Loop.

  • Organizers: Team Attention + Kakao Ventures, sponsored by OpenAI
  • Format: Humans design specs only. AI codes overnight.
  • Winning team’s output: AI wrote 100,000 LOC — 70% of which was test code. Human keyboard input: zero.

The 70% test ratio is the most telling detail. In the Ralph Loop’s Worker-Reviewer architecture, tests function as the success criterion. The agent naturally writes extensive tests because that is what the Reviewer uses to decide SHIP vs. REVISE. More tests mean a faster path to SHIP. The architecture incentivizes test coverage by design, not by mandate.


Implementation Checklist

A preflight checklist for actually setting up a Ralph Loop.

Required

  • Clear task.md: The agent must be able to determine “what to do” instantly at the start of each loop
  • Machine-evaluable success criteria: Test pass/fail, coverage numbers, build success — anything an automated check can judge
  • Git initialized: Every iteration’s changes must be tracked as commits
  • Stop Hook configured: Prevents premature agent termination
  • Cross-Model Review: Separate Worker and Reviewer onto different models
  • Iteration cap: Prevent true infinite loops (5-10 iterations is a reasonable ceiling)
  • CLAUDE.md / .cursorrules: Project conventions specified in files the agent reads on startup

Anti-Patterns

  • Running without tests → infinite flailing (no way to judge success)
  • Vague task.md → each loop diverges in a different direction
  • Modifying source code without running tests → the Worker declares victory based on vibes
graph TD
    A{"Tests exist?"} -->|Yes| B{"task.md is clear?"}
    A -->|No| X["Stop. Write tests first."]
    B -->|Yes| C{"Success criteria automatable?"}
    B -->|No| Y["Stop. Make task.md concrete."]
    C -->|Yes| D["Ready to Ralph Loop."]
    C -->|No| Z["Partial automation. Manual review required."]

Key Takeaways

The technical essence of the Ralph Loop, in one sentence:

An infinite iteration architecture that treats the context window as volatile working memory, the file system as persistent long-term memory, and Git as an audit trail.

This simple structure addresses context collapse at a fundamental level. It applies not only to coding but equally to prompt refinement, configuration tuning, documentation quality improvement, and any iterative refinement workflow where progress can be evaluated against a specification.

The next post looks beyond the Ralph Loop at self-evolving agent systems — where agents update their own weights — and the shifting role of the AI-era developer.


Previous: The Evolution of AI Agent Loops — From RLHF to the Ralph Loop

Next: Beyond the Ralph Loop — Self-Evolving Agents and the Changing Role of the AI Developer

Share

Related Posts