Agents & Architecture

22 posts

Inside the design of frontier agent tools — frameworks, orchestration, memory, Claude Code — distilled into patterns you can transfer to your own system.

Decide where a single agent suffices and where multi-agent setups are necessary; the actual limits of context, memory, and evaluation cycles; and the economic structure of "AI tools building AI tools" — all on the same screen.

Written for solo builders shipping their own agents, PMs and architects evaluating internal agent rollouts, and analysts treating AI tool architecture as study material. The center is decision frames, not code snippets.

Agents & Architecture

Where Did the Agent First Go Wrong: Evaluation Drops to the Claim Level

A read of DRIFT and TELBench, which drop to the claim level to flag the error spans that affect the answer. Finding harmful spans climbs past halfway, but the first error stalls around 20%. Series Part 4, the finale.

Jun 22, 2026

Agents & Architecture

An Agent You Wired Yourself Does Not Come with Its Own Evaluation

A method-first read of the three papers for evaluating custom agents (MAST, AgentRx, DRBench): a vocabulary for failure, a way to pinpoint where it broke, a testbed shaped like your data. How a builder borrows research-scale work. Series Part 3.

Jun 12, 2026

Agents & Architecture

Three Altitudes for Evaluating a Well-Built Agent: Score, Claim, Substrate

A method-first read of the three papers that evaluate well-packaged agents (TRACE, DeepHalluBench, TRAIL): the altitude that rewrites the score, the one that verifies claims, the one that changes the substrate. Series Part 2.

Jun 9, 2026

All Posts

Jun 22, 2026Agents & Architecture

Where Did the Agent Go Wrong: From Answer Accuracy to Process Evaluation

Starting from the high-score illusion where the top-accuracy model ranks last on utility, this piece lays out why evaluation is moving to process and maps the methods of seven papers. Series Part 1.

May 9, 2026Agents & Architecture

Multi-Agent Workflow - 6 Patterns from Supervisor to Swarm

Six core patterns of multi-agent workflow (Supervisor / Sequential / Hierarchical / Network / Swarm / Map-Reduce), grounded in primary sources from LangGraph, CrewAI, OpenAI, and Anthropic. Each pattern's topology and fit, plus a decision framework for production.

May 8, 2026Agents & Architecture

Multi-Agent for Korean Financial Marketing - A Phased Sketch

A phased view of how a Korean financial marketing pipeline (briefing → research → planning → compliance → launch) could be moved onto multi-agents. Phase 1 RAG agent, Phase 2 collaborative multi-agent (CrewAI as one example) with HITL, Phase 3 macro orchestrator + memory layer + observability. Each phase is paired with a component sketch, the patterns worth watching, and where it tends to break - held loosely as one possible path among many.

May 7, 2026Agents & Architecture

Single vs Multi-Agent - Same Sources, Opposite Conclusions

Anthropic reports 90.2% gains from multi-agent. Cognition publishes 'Don't Build Multi-Agents'. A side-by-side reading of the two camps' primary sources, plus a decision framework for single vs multi.

Apr 29, 2026Agents & Architecture

The Advisor Pattern Is a Price Tag, Not Architecture

What surfaces on the second read of Anthropic's Advisor Tool. This isn't new architecture - it's a temporary fix shaped by 2026 pricing. A pattern that disappears once Opus prices drop, and eleven other papers from the same period are quietly moving the same way. The anchor of the series.

Apr 28, 2026Agents & Architecture

How to Organize Agents - Hierarchy, Graph, Swarm, Routing, and Skepticism

How do you organize the sliced agents. Four structures, paired with one skeptical paper that argues 'LLM swarms aren't really swarms.' The conclusion lands where it usually does - structure choice gets dragged along by pricing. Part 2 of the series.

Apr 27, 2026Agents & Architecture

How to Slice an Agent - Five Axes from 2026 Research

The five ways 2026 papers slice an agent, side by side. Two things stand out by the end: Role, Skill, and Judge are different names for the same concept, and the time-axis literature is nearly empty. Part 1 of the series.

Apr 24, 2026Agents & Architecture

Build Your Own Self-Tuning Loop - Reference Implementation Guide

Self-Tuning Loop 4 steps (Generate → Capture → Analyze → Evolve) extracted as a universal module. Supabase DDL, diff capture utilities, analysis/evolution prompts, email/blog examples, GitHub reference implementation.

Apr 21, 2026Agents & Architecture

Cron + Telegram + Claude: Anatomy of a Self-Improving System at $0

The actual production system implementing Self-Tuning Loop from Part 1. Data collection (35 sources), AI curation, Telegram input pipeline, weekly auto-review, and Syncthing-based zero-deploy prompt evolution.

Apr 18, 2026Agents & Architecture

The Wasted Learning Signal - The Gap Between AI Drafts and What You Actually Publish

Introducing Self-Tuning Loop: capture implicit feedback from human edit diffs, analyze patterns periodically, and auto-evolve prompt guidelines. Includes academic gap analysis (DSPy, TextGrad, POHF).

Apr 3, 2026Agents & Architecture

Agent System Design Canvas - 12 Production Patterns Proven by the Claude Code Leak

6-Layer agent system design canvas and 8+4 core patterns from the Claude Code source leak (512K lines TS). From the Circuit Breaker that stopped 250K/day wasted API calls with 3 lines, to 23-step bash AST security.

Apr 3, 2026Agents & Architecture

What the Claude Code Leak Revealed: Anatomy of an AI Agent

The March 31, 2026 npm sourcemap incident revealed Claude Code internals. 4-phase execution, 7 modes, and the 11-step Agent Loop analyzed.

Apr 3, 2026Agents & Architecture

KAIROS, Auto-Dream, Coordinator: What Unreleased Features Reveal About AI's Future

44 feature flags, 20 externally inactive. KAIROS, Auto-Dream, UltraPlan, Coordinator, Bridge, Daemon, UDS Inbox, Buddy, plus anti-distillation and undercover mode.

Apr 3, 2026Agents & Architecture

The Memory System I Built Looked Like Claude Code's Internal Design

CC internal Memory System (4-type persistent memory, Auto-Dream, Auto-Compact) vs independently built 3-layer system (documents→index→semantic search). validate_placement() is the differentiator CC doesn't have.

Apr 3, 2026Agents & Architecture

52 Tools, 23-Step Security: Inside an Agent's Tool System

52 built-in tools' common interface, 10-step execution pipeline, safe=parallel/unsafe=sequential concurrency model, 5-stage permission pipeline, and 888KB Tree-sitter 23-step bash security.

Mar 31, 2026Agents & Architecture

Beyond Ralph Loop - Self-Evolving Agents and the Shifting Role of AI Developers

Ralph Loop solved context rot but remains prompt-bound. This post maps the trajectory from ALAS autonomous parameter updates to Self-Evolving Agent loops, Multi-Agent Swarms with World Models, Korea's Ralphathon results (100K LOC, 70% tests, zero human keystrokes), and the concrete shift in developer roles from implementation to specification and verification.

Mar 31, 2026Agents & Architecture

The Ralph Loop Implementation Guide - From a Bash One-Liner to Cross-Model Review

Starting from while true + cat task.md, building up through stop hooks, file-based state persistence, and cross-model worker-reviewer separation. Three practical examples - coding migration, prompt refinement, and test coverage expansion - plus analysis of the open-source ecosystem and Korea's Ralphathon.

Mar 31, 2026Agents & Architecture

The Evolution of AI Agent Loops - From RLHF to Ralph Loop

RLHF, ReAct, Reflexion, LangGraph/AutoGen, Context Rot, Ralph Loop. Six generations of agent loop architecture - what each solved, what each broke, and why a Bash while-loop turned out to be the answer.

Mar 13, 2026Agents & ArchitectureWICHI

LLM-as-Judge - Evaluating AI Responses with AI

Analysis of the LLM-as-Judge pattern for evaluating AI response quality, featuring multidimensional metric design, reliability verification, and strategies for position and verbosity bias.