The five ways 2026 papers slice an agent, side by side. Two things stand out by the end: Role, Skill, and Judge are different names for the same concept, and the time-axis literature is nearly empty. Part 1 of the series.
The question before orchestration
Almost every paper on agent orchestration opens with the same question: how do you coordinate multiple agents? The answers tend to look similar — hierarchies, graphs, routing, swarms.
There is a question one step earlier hidden inside that one. Before coordination, what are you slicing? The object of coordination has to be defined before the coordination method can be argued.
Read through the papers from Q1 and Q2 2026 and most of them are not about coordination. They are about slicing. The core question is which axis to cut along. The axis decides what the rest of the orchestration design has to do.
This piece is part 1 of the series. It walks through the five axes the 2026 papers actually use, and adds three observations layered on top. If the part 3 anchor made the case that all of this comes down to reversal, part 1 examines the raw material one axis at a time.
The five axes
The six papers cluster onto five axes.
| Axis | Question | Representative paper | Unit |
|---|---|---|---|
| Role | Who plays which role | Multi-role GUI Agents (arxiv 2604.13488) | Roles |
| Skill (external) | Which capabilities are stored as external modules | WebXSkill (arxiv 2604.13318) | Reusable skills |
| Skill (internal) | Which capabilities are absorbed into context | SKILL0 (huggingface 2604.02268) | Internalized skills |
| Time | Does learning continue after deployment | ALTK-Evolve (IBM Research) | Time-axis learning |
| Judge | Who evaluates the agent’s output | AJ-Bench (huggingface 2604.18240) | Judges |
| Planner-Executor | Whether to separate thinking from doing | Plan-and-Act (arxiv 2503.09572) | Planning vs action |
Six papers map onto five axes because Skill splits into two directions (external and internal). Whichever axis you cut along, the resulting system looks substantially different.
Role — slice by role
Multi-role GUI Agents Orchestration (arxiv 2604.13488) takes a direct approach. Break a GUI task into smaller roles and assign each role to a lightweight specialized agent. One reads the screen, one interprets intent, one executes mouse and keyboard actions.
The premise is simple: GUI manipulation tasks burn too much context-switch overhead inside a single model. Splitting by role lets each agent focus on its narrow context, reducing total token use.
The strength of role-based design is that it reads easily. Draw the system diagram and the assignment of work is visible. Useful when explaining structure to non-technical stakeholders. The weakness is that role boundaries depend on task domain. Roles that decompose well for GUI tasks do not necessarily decompose the same way for document drafting. Domain change forces redesign.
Skill — same word, opposite directions
What makes this axis interesting is that two papers use the word “skill” to mean exactly opposite things.
WebXSkill — store capability outside
WebXSkill (arxiv 2604.13318) takes the action patterns a web agent repeats and stores them as reusable skill modules. Product search, add-to-cart, checkout. When a new task arrives, the agent pulls relevant skills from the library and composes them.
The essence is putting capability outside the agent. The agent stays light. Skills are managed as independent, versioned modules. Closer to the engineering analogy of function extraction.
SKILL0 — absorb capability into context
SKILL0 (huggingface 2604.02268) goes the other way. Skills are not stored as separate modules. They are absorbed into the agent’s context through in-context agentic RL. A skill is not an independent object. It is part of the agent’s state.
The philosophy: storing skills externally creates call overhead and switching cost. If the context window is large enough, push them inside.
Same time, opposite proposals
Two papers using the same word to argue exactly opposite directions. This is not just rivalry. With context windows expanding to one million tokens in 2026, what previously had to be split into modules can now live inside the context. The external argument sits on a 2024 assumption (small context). The internal argument sits on a 2026 assumption (large context). The fact that both papers appeared at the same time signals a paradigm transition in motion.
Time — slice by time
ALTK-Evolve (IBM Research, huggingface blog) covers this axis almost alone. The argument is straightforward. Drop the assumption that an agent’s learning ends at training time. Task distributions keep shifting after deployment.
ALTK-Evolve proposes a structure for agents to keep learning from signals accumulated in real user environments. It blurs the line between training and deployment along the time axis.
Here the first observation arrives. Of the papers covered in this piece, only ALTK-Evolve takes time as its primary axis. The rest assume static systems. Fixed roles, fixed skills, fixed judges. But in production, the largest problem agents face is task distribution drift. User patterns shift, external tool APIs shift, business rules shift.
The asymmetry is clear. Research has not caught up with production. Studies still treat the moment-of-design as a snapshot. Operations already require continuous redesign. The empty space in agent orchestration research from late 2026 onward is almost certainly along this time axis.
Judge — slice the evaluation
AJ-Bench (huggingface 2604.18240) turned the agent-as-judge pattern into a benchmark. An agent’s output is evaluated not by a human but by another agent. Environment information is incorporated into the evaluation.
Why this axis appeared is straightforward. Once multi-agent systems became common, humans could no longer judge every turn. Evaluation has to be automated.
The deeper question AJ-Bench raises sits underneath. Does the evaluator have to be smarter than the evaluated? The implicit assumption of traditional benchmark design has been yes — humans evaluating AI is the prototype.
Agent-as-Judge gives this assumption up. Same-tier agents evaluate each other. With this comes gaming risk. The evaluated agent may learn to cater to the evaluator’s preferences and shape outputs accordingly.
This question does not get fully resolved here. What matters is that “who evaluates whom” is now itself an axis along which agents get sliced. That is the 2026 shift.
Planner-Executor — slice thinking from doing
Plan-and-Act (arxiv 2503.09572) is the most classical of the five axes. Separate thinking (planning) from doing (action). The planner builds a higher-level plan; the executor turns it into concrete actions. SOTA on long-horizon tasks like web navigation.
The logic is that thinking and doing are different capabilities, so they are more efficient handled with different models or prompts. A plan can be reused across many steps; an action is needed at every step.
This axis collides head-on with the Advisor pattern from the part 3 anchor. Plan-and-Act has the planner go first, building everything up front. Anthropic Advisor has the executor go first and call an advisor only when needed. The two share the separation of thinking from doing, but the direction of authority is reversed. Through 2025, Plan-and-Act’s direction was the standard. From 2026, the Advisor reversal began. Future papers will compete on which of the two directions is more general.
Five axes at a glance
flowchart TB
AGENT["A single agent"]
AGENT --> R["Role: by role"]
AGENT --> SE["Skill external: as modules"]
AGENT --> SI["Skill internal: into context"]
AGENT --> T["Time: along time"]
AGENT --> J["Judge: by evaluation"]
AGENT --> P["Planner-Executor: thinking and doing"]
R --> EX1["Multi-role GUI Agents"]
SE --> EX2["WebXSkill"]
SI --> EX3["SKILL0"]
T --> EX4["ALTK-Evolve"]
J --> EX5["AJ-Bench"]
P --> EX6["Plan-and-Act"]
The axes are not independent. A role-based system can have skill modules layered on. A planner-executor structure can have a judge stacked on top. Real production systems are hybrids of multiple axes.
Three observations follow.
The same thing under different names
Multi-role GUI splits by roles. WebXSkill and SKILL0 split by skills. AJ-Bench splits by judges. Plan-and-Act splits by planner and executor.
Abstract these four words and they all point to the same thing. A unit of capability that has boundaries, can be invoked, and can be composed. Bounded, invokable, composable capability.
A role is a named capability unit. A skill is an invokable capability unit. A judge is a capability unit aimed at evaluation. Planner and executor are a specialized pair of roles. The vocabulary differs; the substance is the same.
The 2026 papers do not foreground this. The Multi-role GUI paper does not cite WebXSkill. AJ-Bench does not reference Plan-and-Act. Each subfield is reinventing the same abstraction independently.
For consulting work, this implies one move. Do not pass the paper vocabulary (role, skill, judge) directly to clients. These terms will converge in two to three years onto a single abstraction, and what survives is “capability unit.” In system design documents, write capability rather than role. The terminology will change. The design survives.
The time axis is mostly empty
As noted earlier, of the eleven papers in this series, ALTK-Evolve is the only one to take time as its subject.
Eleven to one. In production, orchestration drift, agent decay, and versioning are already top-priority problems. The gap suggests two possibilities.
One: the research community has not yet picked up production signals. There is a delay between industry and academia.
Two: time-axis problems do not translate well into papers. Static systems are easy to write up. Systems that drift over time are hard to evaluate.
Either way, time-axis papers are likely to surge from late 2026 through early 2027. The empty space is the time axis. From an AI strategy perspective, papers and products that fill this gap are the next investment area.
The decomposition unit count keeps rising
The last observation is a time series.
| Period | Decomposition unit |
|---|---|
| 2022–2023 | Single model (no decomposition) |
| 2023–2024 | Role-based multi-model (AutoGen, MetaGPT) |
| 2024–2025 | Role × Skill (skill libraries appear) |
| 2025–2026 | Role × Skill × Judge (AJ-Bench) |
| 2026~ | Role × Skill × Judge × Time (ALTK-Evolve) |
The unit count multiplies over time. What was a single agent in 2022 becomes role × 3 × skill × 20 × judge × 2 × continuous time by 2026. Combinatorial space has grown by hundreds of times.
This accumulation is not free. As axes pile up, the decisions required at initial design grow multiplicatively. Tracing which axis an error originated in becomes harder. Each axis demands its own evaluation framework.
If this trend continues through late 2026, a paradoxical situation can emerge: the cost of designing an agent system exceeds the cost of having a person do the task directly. Around this limit, a counter-trend will likely begin. Roles and skills get merged again. Judges get removed and the planner does its own evaluation.
If the reversal from the part 3 anchor is rearrangement across capability, evaluation, time, and location, this accumulation is where the fatigue from that rearrangement piles up. A 2027 watch point.
Closing
This piece is part 1 of the series. Six 2026 papers placed along five axes, with three observations layered on.
Six papers fill five axes. Role · Skill (external/internal) · Time · Judge · Planner-Executor. Role, Skill, and Judge are different names for the same capability-unit abstraction. The time axis covers only one of eleven papers, and that gap will define the direction of research over the next twelve months. The decomposition unit count is rising fast enough that complexity is accumulating, and at some point the counter-trend will start.
Once you decide what to slice, how to coordinate is largely determined.
Part 2 covers how the sliced units get organized. Hierarchical MAS, DAG, swarm (and its skepticism), MoE-routing. As decomposition has diversified, so has organization. Part of it is already drawing skeptical eyes.
For the part 3 anchor’s claim that all of this is reversal, part 1 provides the first piece of evidence. The decomposition axes themselves are being rearranged in 2026. Skill is moving from external to internal; evaluation from human to agent. Read with part 2, the full landscape of reversal becomes visible.
Series
- Part 1 (this piece): How to slice an agent — five axes from 2026 research
- Part 2 (forthcoming): How to organize agents — Hierarchy, Graph, Swarm, Routing, and skepticism
- Part 3 (anchor): The Advisor Pattern Is a Price Tag, Not Architecture
References
- Multi-role GUI Agents Orchestration. arxiv 2604.13488 (2026-04)
- WebXSkill: Skill Learning for Autonomous Web Agents. arxiv 2604.13318 (2026-04)
- SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization. huggingface 2604.02268 (2026-04)
- ALTK-Evolve: On-the-Job Learning for AI Agents. huggingface blog (2026-04)
- AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation. huggingface 2604.18240 (2026-04)
- Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks. arxiv 2503.09572 (2025-03)
Related Posts

How to Organize Agents — Hierarchy, Graph, Swarm, Routing, and Skepticism
How do you organize the sliced agents. Four structures, paired with one skeptical paper that argues 'LLM swarms aren't really swarms.' The conclusion lands where it usually does — structure choice gets dragged along by pricing. Part 2 of the series.

The Advisor Pattern Is a Price Tag, Not Architecture
What surfaces on the second read of Anthropic's Advisor Tool. This isn't new architecture — it's a temporary fix shaped by 2026 pricing. A pattern that disappears once Opus prices drop, and eleven other papers from the same period are quietly moving the same way. The anchor of the series.

Re-reading the Stanford AI Index 2026 — Why It Feels Weaker Than Last Year
Reading the 2026 AI Index left a recurring impression — this year's edition lands softer than last year's. This piece chases that impression. Side by side: the report's headline indicators, and the shifts that landed outside it during the eight weeks before publication.