“Any sufficiently advanced access control system is indistinguishable from a trust system.” — after Arthur C. Clarke
In Part 1, I described seven architectural layers that turn a chatbot into a working system: memory, knowledge graph, retrieval, delegation, model routing, prompt cache, verification. At the time, it was architecture on paper — a sequential exposition of principles with code examples.
Over the following two weeks, those layers became working code. Not everything went as planned.
What Was Built
- NotebookLM integration — automated ingest of notebooks into a unified knowledge index via extraction → chunking → embedding pipeline
- FAISS + hybrid retrieval — full E5 embedding pipeline, BM25 lexical search, combined via Reciprocal Rank Fusion
- Scoped agent access — two subagents for Telegram group chats (research and teaching) on a shared knowledge base with isolated scopes and taint policy; launched as a pilot with colleagues
- Delegation v1.1 — hardened specification with truthful completion semantics and explicit success contracts
- repo-task-proof-loop — skill for automated verification of subagent work in a sandbox (Docker, no network)
- Retrieval metadata layer —
retrieval_hints,source_kind,sensitivityon every knowledge graph node - Cross-cluster bridges — automatic discovery of semantic connections between isolated clusters
- Agent runbook suite — six operational runbooks for task classification and model routing
What Broke and Surprised Us
Four things stood out:
“The smartest” model isn’t always the best for the task. GPT-5.4 and Gemini 3.1 Pro led us down long dead ends on certain tasks: the models generated verbose explanations, rephrased the problem statement, proposed partial solutions — looping without reaching a result. This was especially visible when the task involved modifying OpenClaw JSON configs autonomously: the model would make incomplete edits without validating the result. Switching to Anthropic Claude Opus or even Sonnet 4.6 — combined with a custom openclaw-docs-expert skill that loads accurate documentation as context — almost immediately produced a concrete solution. Not a verdict on the other models, but a practical confirmation that model routing by task type, combined with the right context, is a real architectural decision, not a theory.
Semantic search alone returned “plausible but wrong” results. A query about the Pan-Tompkins algorithm for ECG R-peak detection returned documents about “signal processing” in general — semantically close, practically useless.
Multi-agent access without scope control leaked context. The teaching subagent was receiving fragments from research papers unrelated to the course. Not because someone misconfigured it — but because nobody configured isolation at all.
Delegation that reports success on empty output is worse than open failure. A subagent received a model provider error, returned (no output), and the parent orchestrator reported: “task completed successfully.”
All four observations are different manifestations of the same deeper theme: trust is not a property of a component. It is a property of the pipeline. Search can be accurate, but if it shows the wrong data to the wrong agent — you don’t trust the system. Delegation can be formally correct, but if empty output means “success” — you don’t trust the results.
This part is about how we built trust — layer by layer, from retrieval to verification.
Hybrid Retrieval: Why Semantic Search Is Only Half the Answer
The first version of search in Ayona was straightforward: text → E5 embedding → FAISS index → cosine similarity → top-K results. The classic scheme from any RAG tutorial.
The problem became visible on real queries. Domain-specific terminology — technical, specialized terms — had characteristics that pure vector search handled poorly. “Pan-Tompkins algorithm” is a specific algorithm for detecting R-peaks in ECG signals. But semantic search saw only “signal processing” in that query and returned general documents about the topic. Semantically close, practically useless.
Another class of problems: bilingual queries. The corpus contains documents in both Ukrainian and English, and a Ukrainian query would sometimes miss a relevant English document even when the key term was identical in both languages.
Solution: BM25 + E5 + Reciprocal Rank Fusion
Instead of improving a single search model, we added a second one — and combined their results.
BM25 is a classical lexical search based on TF-IDF. It doesn’t “understand” semantics, but it excels at finding exact term matches. If the query contains “Pan-Tompkins” and the document contains “Pan-Tompkins” — BM25 finds it, even if vector search missed it.
E5 is a semantic embedding that finds paraphrases, synonyms, and topical similarity well, but can miss on exact terms.
Reciprocal Rank Fusion (RRF) is an algorithm that merges two ranked lists into one without requiring score calibration. The formula is simple: for each document in both lists, compute 1 / (K + rank), where K is a constant (we use 60). The higher a document ranks in both lists, the higher its combined RRF score.
The core implementation — twelve lines:
def rrf_fuse(bm25_results, vector_results, k=60):
"""Fuse two ranked lists using Reciprocal Rank Fusion."""
fused = defaultdict(float)
for rank, (score, idx) in enumerate(bm25_results):
fused[idx] += 1.0 / (k + rank + 1)
for rank, (score, idx) in enumerate(vector_results):
fused[idx] += 1.0 / (k + rank + 1)
ranked = sorted(fused.items(), key=lambda x: -x[1])
return ranked
BM25 places Pan-Tompkins at #1 via exact term match. E5 buries it at #4 among semantically similar documents. RRF fusion brings it back to #1 with a combined score.
Local embeddings-e5 container
Instead of paying for every embedding request to an external API, we deployed our own embeddings-e5 Docker container on the VPS — a FastAPI service built on sentence-transformers, compatible with the OpenAI /v1/embeddings API. This reduced the cost of indexing to essentially zero.
But a problem emerged immediately: the container was loading CPU at 100% even on small batches. The solution was choosing intfloat/multilingual-e5-small: a compact model variant (384-dimensional embeddings vs. 768 in the base variant), significantly smaller inference footprint under CPU-only inference — and sufficient quality for retrieval on our corpus. After switching, the VPS CPU load dropped to acceptable levels.
# 02_distill/deploy/embeddings-e5/docker-compose.yml
services:
embeddings:
image: python:3.11-slim
container_name: embeddings-e5
environment:
MODEL_NAME: intfloat/multilingual-e5-small # small, not base
ports:
- "127.0.0.1:8000:8000" # localhost only
networks:
- openclaw-net
Indexing pipeline:
NotebookLM extraction → chunking (3000 chars, overlap 500)
→ E5 batch embeddings (8-item batches)
→ FAISS IndexFlatIP + BM25 index
Query pipeline:
query → BM25.search(top_k=20) + vector_search(top_k=20) → RRF fusion → top-K results
Each result retains both scores for transparency:
#1 [RRF:0.0323] Pan_Tompkins_ECG_detection.txt
BM25:8.42 | Vec:0.7891
"The Pan-Tompkins algorithm detects QRS complexes in ECG signals..."
Why IndexFlatIP and not IndexIVFFlat?
The corpus is under 10,000 fragments. A flat index (exact search) at this scale runs in milliseconds. IVF would add configuration complexity (nlist, nprobe), a training step — and no practical benefit. The decision to migrate to IVF will need to be made when the corpus exceeds 100k fragments. Not before.
Failure mode: the K parameter
K in RRF is a “position dampener.” A low K (e.g. 1) overly amplifies the difference between first and second place in a ranked list. A high K (e.g. 1000) flattens all positions and makes fusion meaningless. K=60 is the empirical value from the original Cormack et al. paper, which works well for most corpora. We tested on 10 test queries (in Ukrainian and English) and found no reason to deviate.
Practical effect: on a test set of 10 queries (5 Ukrainian, 5 English), hybrid search consistently found relevant results where pure vector search missed on exact terms. Implementation — ~190 lines of code, zero additional external dependencies (BM25 written from scratch).
Scoped Access: An Agent Sees Only What It’s Allowed To
When there’s a single agent in the system, access control isn’t needed. But at this stage we launched two specialized subagents — exclusively as assistants in Telegram group chats for testing with colleagues:
- research-assistant — a bot in the research group, scoped to the
researchcluster (papers, data, methodology) - teaching-bot — a bot in the teaching group, scoped to
teaching(course materials, lectures, student assignments) - main — the core Ayona (full access, outside groups)
Both subagents were an early pilot: real colleagues, real queries in chats, no staging environment. This delivered fast feedback, but immediately raised the practical question of isolation: both subagents ran on the same knowledge base, and without scoping, teaching-bot literally saw raw research data not intended for students — and vice versa.
Solution: agent_scopes.json + ScopedRAG + taint policy
Isolation is built on three layers.
Layer 1 — Scope mapping. A single JSON file defines which agent sees which clusters:
{
"research-assistant": {
"scopes": ["research", "ayona_ops"],
"access_level": "internal",
"deny_patterns": ["memory/*"]
},
"teaching-bot": {
"scopes": ["teaching"],
"access_level": "group",
"deny_patterns": ["memory/*", "02_distill/research/*"]
},
"main": {
"scopes": [],
"access_level": "internal",
"deny_patterns": []
}
}
An empty scopes means full access. Deny patterns are additional glob masks filtering paths the agent must not see even within its allowed scope.
Layer 2 — ScopedRAG middleware. A class that sits between the agent and the search index:
rag = ScopedRAG.from_config("teaching-bot")
results = rag.search("student assessment methods", top_k=5)
from_config() loads the scope, then search() sequentially applies: scope filter → deny pattern matching → taint policy → audit log.
Key decision: deny-by-default for unknown agents. If agent_id is not in the config, the agent gets an empty scope ["__NONE__"] and access level "denied". This is the only safe default in a multi-agent system.
if agent_id not in configs:
return cls(scopes=["__NONE__"], access_level="denied")
Layer 3 — Taint policy. Even within an allowed scope, not every source is equally trustworthy. Taint policy defines what can be done with a result based on its trust level:
TRUST_LEVELS = {
"internal": 3, # workspace files — full trust
"external": 2, # Drive, NotebookLM — can inform, not initiate
"untrusted": 1, # web, forwarded — display only
}
ACTION_RISKS = {
"read": 1, # display information
"analyze": 1, # analysis, summarization
"suggest": 2, # proposing actions
"execute": 3, # executing commands
"external_send": 3, # email, telegram, webhook
}
The rule is simple: trust_level must be at least as high as action_risk. An untrusted source cannot initiate execute or external_send. An external source requires explicit approval for high-risk actions.
Audit. Every scoped query is logged to rag_audit.jsonl:
{"timestamp": "2026-03-25T14:22:01", "query": "student assessment methods",
"scope": "teaching", "action": "read", "allowed": 4, "blocked": 1}
Not just for debugging. This answers the question: “Why didn’t the bot find this document?” — which comes up in production sooner than you’d expect.
Practical effect: agents share one workspace and one index, but see completely different slices of knowledge. Adding a new agent is one JSON entry. Removing access is deleting that entry. No changes needed in the search code.
research-assistant and teaching-bot pass through the ScopedRAG middleware, which applies scope filter → deny patterns → taint policy sequentially. An unknown agent automatically receives scope [“NONE”] and access level “denied”.
Delegation v1.1: Empty Output Is Not Success
In Part 1, I described the bounded delegation envelope — a contract that formalizes delegation instead of “just send a prompt to a subagent.” Between theory and production lay one mistake that rewrote the entire specification.
What Happened
A child agent received a task: write an analytical document and save it as an artifact. During execution, the model provider returned an error. The child agent completed with (no output). The parent orchestrator received the completion event and reported: “task completed successfully.”
Nobody checked whether the artifact appeared. Nobody looked at the child history. Completion event arrived — so: success.
This is the classic deceptive completion — a failure that masquerades as success. And it destroys trust in the entire delegation system.
Three Rules of v1.1
We didn’t rewrite the architecture. We added three rules — a hardening pass, not a redesign.
Rule 1: Truthful completion semantics. If the child session ends with error semantics (stopReason == error, non-empty errorMessage, or abort with failure payload), the parent completion must be failed. Period. (no output) does not mask failure.
Rule 2: Empty output is a suspect state. Empty completion is not a valid success. It is a suspicion requiring mandatory verification:
- Inspect child session history
- Inspect terminal assistant message
- Inspect expected artifact path
- Only then declare a result
Rule 3: Explicit success contract. The launch packet must now define success criteria before execution:
delegation_flow: v1.1
task_class: writing
primary_model: anthropic/claude-sonnet-4-6
expected_output_type: artifact
expected_artifact_path: 03_insights/analysis.md
success_condition: "artifact exists and is non-empty"
failure_condition: "terminal error OR artifact missing"
verification_steps: [child_history, artifact_exists, terminal_state]
Before v1.1: empty output → completion event → “Task completed successfully” → trust destroyed. After v1.1: three checks (error semantics, empty output, artifact existence) → Status: FAILED → retry or escalate.
Artifact gate
If expected_output_type is artifact or both, completion cannot be success until the artifact exists. This is a mechanical rule that doesn’t depend on the model’s “opinion” about result quality. The file either exists — or it doesn’t.
Practical effect: zero false-success completions after adopting v1.1. Verification adds a step to every delegation — but it’s a price worth paying.
Proof Loop: Code That Proves It Works
Delegation v1.1 ensures we don’t miss a failure. But how do we get from “we noticed a failure” to “the system retries on its own and proves success”?
For that we built repo-task-proof-loop — a skill that closes the loop: from task description to automatically verified result.
Pipeline
Task description
→ acceptance criteria generation (heuristic or LLM)
→ subagent run in sandbox (Docker, no network)
→ automated verification
→ if must-criteria failed → retry
→ if passed → apply patches → create branch → push → PR URL
Orchestrator: while-loop with verification
The heart of the skill is the Orchestrator class, which coordinates runner, verifier, and acceptance criteria:
def run(self):
report = {'run_id': self.runner.run_id, 'iterations': []}
iteration = 0
success = False
while iteration < self.max_iterations and not success:
iteration += 1
# run subagent work
rc, out, err = self.runner.run_subtask('...')
# verification
results = self.verifier.run_criteria(self.ac.get('criteria', []))
# decision: any must-failures?
must_fail = [r for r, c in zip(results, self.ac['criteria'])
if c.get('severity') == 'must' and r['rc'] != 0]
success = len(must_fail) == 0
if success:
# apply patches, push branch, create PR
branch = f'feature/run-{self.runner.run_id}'
self.runner.apply_patches(branch_name=branch, push=True)
Verifier: deliberate simplicity
The verifier is a consciously minimal component. The entire code fits in a couple dozen lines:
class Verifier:
def __init__(self, workdir='.'):
self.workdir = workdir
def run_test(self, command: str):
proc = subprocess.Popen(command, shell=True, cwd=self.workdir,
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = proc.communicate()
return {
'command': command,
'rc': proc.returncode,
'stdout': out.decode('utf8', errors='replace'),
'stderr': err.decode('utf8', errors='replace')
}
def run_criteria(self, criteria):
return [self.run_test(c.get('test_command')) for c in criteria]
This is not a bug — it’s a decision. The verifier must be auditable in seconds. If the verifier itself needs verification, that’s a recursion going nowhere.
Must vs Should
Acceptance criteria have two severity levels:
- must — block completion. If pytest fails — retry or fail.
- should — informational. If a linter finds a style issue — that’s not a reason to block working code.
This distinction is critical. Without it, a flaky linter or pedantic formatter can endlessly block a subagent that has already produced working code.
Practical effect: subagent work that previously required manual review of every result now self-verifies. A human reviews the PR, not the process. The proof loop is delegation envelope v1.1 made executable.
Knowledge Graph as Context Routing
In Part 1, the knowledge graph was described as a “topology of decisions” — a structure that reduces the entropy of knowledge access. Over two weeks it also became a context router for retrieval.
Retrieval metadata layer
Previously, graph routing depended on the first paragraph of a markdown card — human text written for humans. This worked while the number of cards was small. At 100+ nodes, routing quality started depending on how well the author had written the opening sentence.
Solution: add a machine-oriented metadata layer to every routing-critical node:
retrieval_hints:
- "model routing policy for subagent delegation"
- "ops-lite vs design-tier task classification"
source_kind: policy
sensitivity: internal
canonical_artifacts:
- 99_process/agent_runbooks/task_classification_and_model_routing.md
Now the graph-context skill reads retrieval_hints directly, instead of parsing markdown and hoping the first paragraph is informative enough.
Bucket A/B/C classification
Not all nodes are equally important for routing. We classified them into three categories:
- Bucket A (graph-facing hubs) — policy cards, protocol references, routing entry points. These nodes received full retrieval metadata first.
- Bucket B (supporting) — useful but not routing hubs.
- Bucket C (passive) — rarely queried, low-touch.
This let us focus metadata enrichment effort on ~20% of nodes that handle ~80% of routing traffic.
Cross-cluster bridges
Scoped access isolates clusters — and that’s right for security. But sometimes knowledge from different clusters is semantically related, and those connections are worth knowing.
cross_cluster_bridges.py compares embeddings across scoped indices and finds fragment pairs with cosine similarity above a threshold (default 0.78):
# Compute cross-similarity matrix
sim = emb1 @ emb2.T
# Find pairs above threshold
indices = np.argwhere(sim > threshold)
The result is a list of “bridges” between clusters:
[0.8234] research ↔ teaching
02_distill/research/nlp_evaluation_metrics.md
02_distill/teaching/text_generation_lab.md
This doesn’t break isolation — bridges are visible only to the administrator. But they suggest: “perhaps these materials should be linked in the graph” or “this research result overlaps with a lecture topic.”
Practical effect: graph routing became deterministic instead of probabilistic. Context selection stopped depending on how well a human wrote the opening paragraph of a knowledge card.
Lessons and Failure Modes
Two weeks of implementation left lessons worth recording.
One search method is one point of failure. Pure vector search looks elegant, but on domain-specific terminology and bilingual queries it systematically misses. Hybrid (BM25 + semantic + RRF) is cheap to implement and substantially more reliable.
Deny-by-default is the only safe default. If a new agent automatically sees everything — that’s not a feature, it’s an incident waiting to happen. Unknown agent = zero access.
Empty output is not a neutral result. It’s a suspect state. Systems that treat empty output as “nothing happened” will eventually miss a real failure. Better to stop and verify than to move forward with an optimistic “probably fine.”
The verifier must be simpler than the code it verifies. If the verifier is a complex system with its own dependencies and configuration — who verifies the verifier? A couple dozen lines is not a limitation. It’s a decision.
Metadata for machines is not the same as documentation for humans. The first paragraph of a markdown card is written for a reader. retrieval_hints are written for the retrieval pipeline. When these two goals mix in one text, both suffer.
From Theory to Pipeline
Each layer answers a distinct question: what to find → who can see it → which context → how to execute → did it actually work. Together they form a single trust pipeline.
Architecture without implementation is a whitepaper. Implementation without trust verification is a demo.
Over two weeks we moved from “here’s how this should work” to “here’s where it broke and how we fixed it.” Hybrid retrieval, scoped access, delegation hardening, proof loop, metadata-enriched graph — each of these components solves its own concrete problem. Together they form a single trust pipeline: from what the system finds, through who it shows it to, to how it proves the work was actually done.
Trust is not a feature. It’s a pipeline.
To be continued.
Serhii Zabolotnii — DSc, NLP/LLM Researcher, Professor, AI Systems Architect. Building Ayona — an AI-native research and operations system.