ADR 0005: Advanced Agentic Cognitive Architecture¶

Date: 2026-04-06 Status: Proposed Context: The RUNE platform currently supports linear, single-step, or hard-coded tool-based agents. To support Tier 3 autonomous agents (e.g., advanced DevSecOps or SRE automation), the architecture must evolve to support continuous internal reasoning, dynamic tool discovery, persistent memory, and proactive safety boundaries.

1. Agents with Automatic Self-Reflection¶

Problem: Current AgentRunner implementations execute a prompt and return the first result. They cannot verify their own logic or catch hallucinations before responding to the orchestrator. Decision: Introduce a ReflectionDriver interface or a configurable reflection loop within AgentRunner. Details: - The agent generates an initial response (Draft). - The agent is prompted again with its Draft and the original context: "Critique this response for accuracy, completeness, and safety." - The agent generates a final response based on the critique. - The DriverTransport layer must log the delta between the Draft and Final response for benchmarking the efficacy of the reflection step.

2. Multi-Step Planning with Dynamic Tool Routing¶

Problem: Workflows are currently statically defined in rune_bench/workflows.py. Agents cannot dynamically string together tools (like kubectl, aws-cli, and git) to solve novel problems. Decision: Adopt the Model Context Protocol (MCP) for dynamic tool provision. Details: - The RUNE orchestration layer will act as an MCP Server, exposing available tools and resources. - The AgentRunner will act as an MCP Client. - The agent will generate a multi-step plan, executing tools via MCP iteratively until the goal is met. - State transitions (Plan -> Act -> Observe) will be explicitly tracked in the benchmark output.

3. Advanced Memory Persistence¶

Problem: RUNE's SQLite/S3 sinks are for post-job auditing, not active agent memory. Agents suffer from "amnesia" between distinct task executions. Decision: Extend the AgentRunner base class with a MemoryProvider interface. Details: - Episodic Memory: Short-term vector storage representing the current benchmark session (e.g., "I just ran kubectl get pods and saw a CrashLoopBackOff"). - Semantic Memory: Long-term knowledge graphs or vector databases containing domain knowledge (e.g., "In this environment, a 502 on the checkout service usually means the payment gateway is down"). - Procedural Memory: Cached sequences of successful tool executions (e.g., "The verified sequence to rollback a deployment is X -> Y -> Z").

4. Advanced Agentic Safety Models¶

Problem: Security currently relies on static CI/CD gates (SAST, SCA) and fail-closed cost estimation gates. Autonomous agents with write access require runtime behavioral guardrails. Decision: Implement a Safety Interceptor (or "Observer Agent") within the DriverTransport layer. Details: - Before any tool execution request (via MCP or direct driver call) reaches the host system, the Interceptor evaluates the action. - The Interceptor operates on a default-deny policy for destructive actions (e.g., rm -rf, IAM role modification) unless explicitly whitelisted by the benchmark definition. - If an action is flagged, the Interceptor blocks the execution and returns a "Safety Violation" context back to the agent, forcing it to replan.

Consequences¶

Positive: Enables RUNE to benchmark cutting-edge autonomous agents accurately; positions RUNE as a Tier 3 testing platform.
Negative: Significantly increases the complexity of the AgentRunner and DriverTransport layers; raises the minimum LLM capabilities required (e.g., deep reasoning models like deepseek-r1 or qwen3:32b).