Serhii Zabolotnii
← All posts
AI LLM architecture agents delegation schema RAG composition orchestration

When a Single Agent Hits Its Limits: Ayona/OpenClaw's Shift from Orchestration to Composition

Schema governance, API-first vs CLI-first subagent types, AutoResearch protocol, design-before-code gates, and composable agent franchising. Part 3.

There comes a moment in the life of every AI system when a single agent stops being enough. Not because it’s bad — but because real work demands specialization. You need a research agent capable of planning searches and synthesizing results. You need an artifact generator that works with the file system autonomously. You need specialized agents for different knowledge domains with different access levels. And right here you discover: scaling agency is not “just spin up another process.” It’s an entirely different problem.

Ayona/OpenClaw is an AI-native system for research and operational tasks that I’ve been developing for several months. In previous articles, I described why architecture matters more than model strength and how trust emerges from a pipeline, not from a single component. But both articles described a single execution loop — one agent. Now it was time to take the next step: from orchestration to composition.


From One Agent to Many: Why It’s a Different Problem

When a single agent is running, the main question is orchestration: in what order to execute steps, what to feed as input, how to verify output. It’s complex, but it’s linear complexity.

When there are multiple agents, governance emerges — a set of questions that simply don’t exist for a single agent:

Context isolation. A research agent shouldn’t see restricted data from the legal cluster. A teaching bot shouldn’t have access to internal operational notes. If isolation isn’t defined architecturally, it won’t emerge on its own. We already learned this lesson: in the second article, scoped access appeared precisely because a teaching bot accidentally received research data — nobody had configured filtering.

Model selection. For a single agent, model routing is a table: this task type → this model. For multiple agents, an additional dimension appears: the subagent type determines not just the model, but the execution pattern. An autonomous artifact generator and a reasoning agent with checkpoints require fundamentally different orchestration approaches.

Completion trust. Delegation v1.1 taught us that empty output is not success. But when there are multiple subagents working asynchronously, completion verification becomes a systemic task, not a one-off check.

Scope enforcement. Who decides what a new agent can access? Who determines which actions are allowed? If the answer is “whoever wrote the prompt” — that’s not governance, that’s hope.

The naive solution — “just launch a subprocess with a prompt” — ignores all these questions. It works for demos but breaks in production. So we needed a framework, not a script.

And the first step toward that framework wasn’t a new feature — it was discipline.


Schema as Governance: Ontology SSoT

Previous articles described the knowledge graph as a structure of decisions. But one question remained open: who determines what’s valid in that structure? Before building a subagent framework or AutoResearch, we had to answer exactly this — otherwise every new component would add entropy instead of order.

The answer: config/ontology.yaml as Single Source of Truth.

Before formalizing the ontology, the graph lived in “everyone writes however they see fit” mode. Node types, clusters, relations, tags — all of this existed in Python code, in document frontmatter, in generation scripts. Nothing was validated systematically. The result? When we ran a graph audit on 176 nodes, it found 75 schema errors and 367 warnings. 20 cluster variations instead of 10 canonical ones. 21 relation types instead of 10. 349 unique tags without any taxonomy. 50% of nodes missing summaries.

Initial health score: 0%.

This illustrates an important law: without enforcement, schema is decoration. You can have the most beautiful ontology in the world, but if it’s not validated at creation and not checked at commit, it degrades invisibly.

Now ontology.yaml defines:

  • 9 node types with aliases for legacy names (knowledge, insight, task, project, direction, process, decision, person, longterm)
  • 10 clusters with hierarchy up to 3 levels (ayona_ops → system_integration, clients → agentic_streamline, healthprecision, research, teaching, supreme_court, archive, ndi_zsu)
  • 10 relation types (IMPLEMENTS, EXTENDS, BUILDS_ON, PART_OF, DEPENDS_ON, COMPLEMENTS, VALIDATES, SUPERSEDES, CONTRADICTS, RELATED_TO)
  • Sensitivity levels (public, internal, restricted)
  • Agent scopes — access mapping per agent

All scripts — graph generation, frontmatter validation, graph-writer skill, audit — automatically load the ontology through ontology_loader.py. This means changes to a single YAML file cascade across the entire pipeline. Adding a new node type means changing one line in ontology.yaml, not hunting through code.

A pre-commit hook in the pipeline automatically checks: does every node conform to the closed type vocabulary? Is every cluster valid? Are there any “creative” relations that don’t exist in the ontology?

Fig. 1. Ontology SSoT: one YAML file cascades across the entire pipeline. Without enforcement — 0% health on 176 nodes. Ontology SSoT: one YAML file cascades across the entire pipeline. Without enforcement (red path) — 0% health on 176 nodes.

This is schema as governance: the knowledge structure is controlled not by conventions or developer memory, but by automated validation. And it’s precisely this discipline that made it possible to build everything that followed — the subagent framework, AutoResearch, agent franchising — on a stable foundation rather than on conventions that collapse under load.


Two Types of Subagents: API-first and CLI-first

With a stable knowledge schema in place, we could move to the next question: how exactly to organize task execution across multiple agents?

One of the key decisions looks simple but has far-reaching consequences: determine the subagent type BEFORE implementation. Never mix patterns.

This distinction grew from practice, not theory. We noticed that subagent tasks naturally fall into two classes requiring fundamentally different control mechanisms.

API-first subagent is needed when the main agent must be present between steps: making decisions, validating intermediate results, directing the next step. These are reasoning-intensive tasks with checkpoints and quality gates. The subagent generates not a final artifact, but a set of prompts and data that the main agent executes through LLM calls. Control stays with the orchestrator.

CLI-first subagent is needed when the task is bounded and autonomous: generate a presentation, run a code review, perform a deployment check. The subagent launches as an independent process with its own skill directory, works to completion, returns a result. The main agent doesn’t intervene in intermediate steps — it only receives the outcome.

The quick routing test is simple: does the main agent need to be between the subagent’s steps to make a decision or validate? If yes — API-first. If no — CLI-first.

The main anti-pattern is hybrid. We learned this in practice when we tried giving a CLI agent for presentation generation an intermediate checkpoint: let the main agent verify the outline before the agent generates slides. Sounded reasonable. In practice — the CLI process had no mechanism to “stop and wait for a decision from above,” so we started layering ad-hoc signaling through the file system, then a polling loop, then timeout handling. Within two days, the prototype became a brittle Rube Goldberg machine that broke on every edge case. The correct answer turned out to be simple: these are two different subagents. A CLI agent generates a presentation from a ready brief. If the brief needs to be verified first — that’s a separate API-first step before it.

In practice, this decision saves enormous amounts of debugging time. When the type is determined upfront, the execution architecture is unambiguous. No “maybe we should add another step from outside” or “let’s give the CLI agent state between steps.” Pattern purity isn’t pedantry — it’s protection against growing complexity.

Fig. 2. API-first vs CLI-first: the key difference is where main agent sits — between steps (left) or only at boundaries (right). API-first vs CLI-first: the key difference is where main agent sits — between steps (left) or only at boundaries (right).


AutoResearch as an API-first Subagent Model

AutoResearch is the first complete API-first subagent in the system, and its design became the template for all subsequent ones.

Two intellectual sources shaped the architecture. Andrej Karpathy proposed the idea of autoresearch as a universal skill — an automated research loop. One implementation of this concept (the balukosuri/Andrej-Karpathy-s-Autoresearch-As-a-Universal-Skill repository) gave us not code to copy, but a research-grade specification: 5-phase structure, binary evaluation, sacred validation set, mutation operators. Rinat Abdullin developed the Schema-Guided Reasoning (SGR) approach — reasoning through a predefined structural schema instead of a free prompt — which we had already used in LLM StructCore (CL4Health) and knew from production (sgr-agent-core by vamplabAI).

What changed during adaptation to our stack? Karpathy’s loop is prompt optimization for a single agent. We needed a multi-step research pipeline for a multi-agent environment where the main agent controls each step. SGR in its pure form is a reasoning framework. We needed a structured output pipeline with quality gates before writing to the knowledge graph. So we took loop discipline from Karpathy and schema-guided structure from Abdullin, but rebuilt them around a different principle: the subagent doesn’t make final decisions — it generates data and prompts for the orchestrator.

The protocol consists of three steps.

Step 1 — Plan. The main agent passes a research query. The subagent returns not a result, but a PlanRequest: system prompt and user prompt for an LLM call that should create a structured research plan. The main agent executes this call and receives JSON with a brief and search plan.

Step 2 — Gather. The main agent passes the JSON from step one. The subagent parses the plan, launches data collection from five providers — keyword search on the knowledge graph, BM25 full-text, web search via Tavily API, vector search via E5 embeddings, Notion databases — and returns a GatherResult: prompts for a synthesis LLM call, source statistics, and a checkpoint.

Step 3 — Finalize. The main agent passes the JSON synthesis. The subagent validates, assesses quality, generates knowledge cards, and if quality passes the gate, writes them to the graph.

The key detail: the main agent makes decisions between steps. It can stop the research after step one if the plan is unsatisfactory. It can edit the brief before collection. It sees source statistics and can decide that more web results are needed. It sees the quality score and can reject writing to the graph.

This is exactly why it’s API-first: the subagent doesn’t make final decisions itself. It generates data and prompts, and decisions stay with the orchestrator.

Fig. 3. AutoResearch 3-step API: synthesis of Karpathy's loop discipline + Abdullin's SGR. Gates (G) = orchestrator decision points. AutoResearch 3-step API: synthesis of Karpathy’s loop discipline + Abdullin’s SGR. Gates (G) = orchestrator decision points.

A separate discipline is the explicit trigger. AutoResearch doesn’t activate on any research query. Soft triggers (“research,” “analyze,” “compare”) require user confirmation. Only the keyword autosearch launches the full flow without an additional prompt. This isn’t a UX decision — it’s a safety decision: an autonomous research agent with five providers shouldn’t start accidentally.


Design-before-code: A Quality Barrier Before Implementation

As the number of components and subagents grew, another problem appeared: too many new features started with code instead of design. The result — rework, inconsistency with existing patterns, missed edge cases that could have been caught at the design stage.

The answer is the design-architect skill, which formalizes design-before-code as a mandatory discipline.

The protocol starts with classification. Every non-trivial task is routed to one of four tracks:

TrackWhenArtifact
ExecPlan>4 hours, multi-moduleLiving PLANS.md with progress tracking
RFCCross-project, infra, new APIStructured RFC in 99_process/
RunbookBounded task for subagentRunbook with delegation envelope
Brief<2 hours, moderate complexityInline brief (not persisted)

After classification — scaffolding of mandatory sections: problem statement, scope (in/out), definition of done, safety/rollback.

Then — a hard gate. The design artifact is presented to the user, and without explicit confirmation (“proceed,” “greenlight,” “approved”), implementation doesn’t begin. The agent cannot approve itself.

For new code components of existing skills, another layer is added: a contract-first protocol. First design draft, then schema (dataclasses, types, API contract), then eval criteria, and only then — code. This order isn’t accidental: it guarantees that code is written to specification, not the other way around.

Practical effect: the number of “serious reworks due to missed requirements” dropped dramatically. Not because design magically solves all problems, but because it forces thinking before doing.


Composability: Template, Franchising, Remote Execution

The most important architectural shift of this stage was the transition from a working system to a composable system — one whose components can be recombined and deployed independently. Three decisions define this transition.

Public template. The Ayona/OpenClaw workspace was sanitized and published as a replicable template. This isn’t a fork or documentation — it’s a full workspace with its own knowledge graph pipeline (markdown cards → JSON → D3.js), 10 cross-linked knowledge cards, an interactive setup script, a pre-commit hook with 9-phase validation, parameterized agent identity, and GitHub Actions CI.

The key design principle: the template has a sync mechanism with the main repository. Changes in architectural patterns cascade, but knowledge content stays isolated. This isn’t just “copy and live your own life.” It’s a composable building block with guaranteed architectural compliance.

Agent franchising. Tag-based graph projection allows projecting a subset of the knowledge graph onto an isolated subagent workspace. Bidirectional sync means that updates in the franchised space return to the main graph, passing through ontology validation. Agent scopes from ontology.yaml determine what exactly each franchised agent can see and modify.

This solves a problem that inevitably arises during scaling: how to give a specialized agent access only to its domain without duplicating knowledge or breaking graph integrity. A concrete example: a teaching bot for the supreme_court cluster sees only nodes with the corresponding scope — dozens of cards out of hundreds in the total graph. When it creates a new card, it passes through ontology validation and appears in the main graph already consistent. Without franchising, this agent would either see everything (isolation violation) or work on a copy that diverges within a week.

SubagentExecutor bridge to VPS. CLI-first subagents need infrastructure for execution. SubagentExecutor built a bridge between local orchestration and remote execution: a subagent can launch not only as a local subprocess but also via SSH on a VPS running a claude-agent wrapper with OAuth authorization. The main agent forms a delegation envelope, the executor sends it to the VPS, Claude Code on the remote machine executes the task with access to skill directories, and the result returns through system event notification.

This solves a practical problem: not all tasks can be executed locally. Artifact generation, working with large files, accessing production infrastructure — all of this requires remote execution. SubagentExecutor turns a single machine into a distributed agent runtime without losing the audit trail or completion verification.

Fig. 4. Composition layer: three mechanisms unified by shared ontology governance. Each can be used independently. Composition layer: three mechanisms unified by shared ontology governance. Each can be used independently.


Lessons and Failure Modes

In the first two articles, I described failure modes of a single agent: brittle memory, bloated context, deceptive completion semantics. When scaling to multiple agents, these problems don’t disappear — they mutate.

Delegation without an envelope: from “try something” to “N agents trying something simultaneously.” In delegation v1.1, we taught a single agent not to lie about completion. But when there are five delegations in parallel, the problem multiplies nonlinearly: each needs its own scope, its own model routing, its own verification plan. Now every delegation is a mandatory envelope: objective, scope (in/out), runbook path, model selection with rationale, context route, expected artifacts, verification, stop points. The subagent returns a completion contract: what was done, where artifacts are, proof of execution, what wasn’t done, escalation risks.

Schema without enforcement: degradation nobody noticed. We had an informal ontology from day one — and believed it was working. The graph audit said otherwise (numbers above). Only when the schema became machine-enforced (Pydantic validation, pre-commit hooks, ontology_loader in all scripts) did it stop being decoration and become governance.

Composability without governance: replicating problems, not solutions. The first attempt to “just copy the workspace” for another project led to pattern divergence within a week. Template sync and ontology validation solve this — but only together.

More agents without routing discipline: chaos grows exponentially. Without the API-first/CLI-first distinction, we ended up with subagents that “could do everything but nothing well.” Strict routing isn’t a constraint — it’s a condition for quality.

Design-before-code doesn’t work as a “recommendation.” Only as a hard gate with explicit human approval. Without a forced stop before implementation, the agent (like a human) always finds a reason to “just quickly write it.” It always costs more.

Remote execution ≠ “cron + Telegram bot.” There’s a temptation to solve the remote execution problem the simple way: a cron job that runs a script on a VPS and sends the result to Telegram. This works for routine tasks. But when a subagent needs access to skill directories, a delegation envelope, OAuth authorization, and completion verification — a simple cron becomes an unreliable chain of ad-hoc solutions. SubagentExecutor solves this as an architectural component, not a DevOps hack.


From “AI That Works” to “AI That Replicates”

Looking at the trajectory across three articles, a clear evolution emerges.

The first article answered the question: why is architecture needed? Because without it, a model — no matter how powerful — cannot work repeatably.

The second article answered the question: how does trust emerge? Not from a single component, but from a pipeline where each layer verifies the previous one.

This article answers the next question: how to scale without scaling chaos?

The answer, in my view, comes down to five principles:

  1. Type discipline. Determine the execution pattern (API-first or CLI-first) before implementation. Never mix.
  2. Schema governance. If the knowledge structure isn’t validated automatically, it degrades. Ontology as SSoT with machine enforcement.
  3. Design-before-code. A hard gate before implementation with explicit human approval. Not a recommendation — enforcement.
  4. Bounded delegation. Every delegation is an envelope with objective, scope, routing, artifacts, verification, stop points.
  5. Composition over accumulation. Not “add another feature,” but “create a building block that can be reused with guarantees.”

The maturity of an AI system is measured not by the number of features or connected models. It’s measured by how reliably you can create a new specialized agent, give it scope, connect it to the knowledge graph, ensure delegation with verification — and be confident it won’t break what already works.

This is the transition from orchestration to composition. Not “one agent that does everything,” but an architecture that allows multiplying agents with discipline.

And perhaps this is the next maturity frontier for AI systems in general: not how powerful a single agent is, but how predictably the system creates and controls new ones.


Serhii Zabolotnii — DSc, NLP/LLM Researcher, Professor, AI Systems Architect. Building Ayona — an AI-native research and operations system.