Skip to content

AI & Agent Security

Threats and defense patterns specific to AI-assisted development and agentic workflows. This complements the infrastructure-focused Security Overview with practitioner guidance for building and operating agent systems safely.

1. Prompt Injection Defense

Threat: Malicious content manipulates agent behavior by injecting instructions into data the agent processes. See Security Overview, Section 1.2 for the threat model.

Attack Vectors in the Toolkit

Vector Entry Point Example
Crafted article intake/inbox/ Markdown with hidden instructions in YAML frontmatter
Malicious SKILL.md Plugin installation SKILL.md with injected prompts in description field
MCP tool results Any MCP tool call External service returns data containing instructions
Proposal injection intake/pending/ Proposal with target_skill pointing to a sensitive skill

Defense Patterns

Structured parsing over free-text interpretation: Process metadata (YAML frontmatter, structured fields) separately from prose content. The intake pipeline's Step 2a-bis validates YAML structure before the agent interprets the article content.

Input validation as first gate: /inbox-qualify validates YAML frontmatter, sanitizes slugs (strips .., /, \, shell metacharacters), and validates target_skill against the skill manifest. Malformed YAML is auto-classified as archive.

SKILL.md frontmatter audit: setup.sh reads allowed-tools declarations during installation and warns about skills declaring write-capable tools (Bash, Write, Edit, NotebookEdit). This creates an audit trail.

Tool result flagging: The system prompt instructs agents to flag suspected prompt injection in MCP tool results. This is a convention-based defense (agents are told to be suspicious of tool output), not a structural one.

2. Agent Tool Misuse Prevention

Threat: Agents execute tools beyond their intended scope, either through misconfiguration or through sub-agent privilege escalation. See Security Overview, Section 1.4 and Agent Architecture Patterns.

Defense Layers

allowed-tools as declaration of intent: Skills declare their tool requirements in SKILL.md frontmatter. This is a convention — the Claude Code runtime does not enforce it — but it enables audit and review.

block-writes.sh for hard enforcement: Read-only agents (review agents, research agents) can use the block-writes.sh hook script to structurally prevent write operations, regardless of what the agent attempts.

bypassPermissions cascade risk: When an orchestrator runs with bypassPermissions, that privilege flows to all sub-agents it spawns. This is a known Claude Code behavior. Orchestrators should spawn sub-agents with the minimum required permission mode.

Sub-agent privilege boundaries: A read-only review agent can dispatch a general-purpose sub-agent with full tool access. The runtime does not enforce the parent's restrictions on children. Defense: orchestrators should declare max_subagent_tools in their agent definitions, and code review should catch violations.

MCP server whitelisting: enableAllProjectMcpServers: false in settings prevents project-level .mcp.json files from automatically activating MCP servers. Only explicitly approved servers are available.

3. Hallucination Risks in Agentic Workflows

Threat: Agents fabricate information about tools, files, or status, leading to incorrect actions or false confidence.

Hallucination Types

Type Example Impact
Tool call hallucination Agent invokes a tool that doesn't exist or fabricates parameters Error at best, data corruption at worst
Path hallucination Agent references files that don't exist Wasted effort, incorrect recommendations
Status hallucination Agent claims "tests pass" without running them False confidence, bugs reach production
Content hallucination Agent generates plausible but incorrect code or facts Subtle bugs, incorrect documentation

Defense Patterns

Verification before completion: The /vt-c-verification-before-completion skill requires agents to run verification commands and confirm output before claiming success. "Evidence before assertions" — no success claim without proof.

Convergence checks: The /vt-c-ralph-wiggum-loop skill prevents completion claims without evidence of convergence. It wraps verification in a loop that checks whether output actually matches expectations.

Multi-agent cross-checking: /vt-c-4-review dispatches 6 parallel review agents. If one agent hallucinates a finding, the others are unlikely to produce the same hallucination. Consensus across independent agents reduces hallucination risk.

Spec-driven validation: The spec compliance reviewer checks implementation against specification acceptance criteria. This structural check catches agents that claim to have implemented features they skipped.

4. Chain-of-Thought Poisoning

Threat: Crafted content influences the agent's reasoning across tool boundaries, causing it to make decisions that serve the attacker's goals rather than the user's.

How It Works

  1. Agent reads an external article via /inbox-qualify
  2. The article contains persuasive (but false) technical claims
  3. The agent's reasoning is influenced when it later processes unrelated work
  4. Decisions are subtly biased by the injected framing

Defense Patterns

Context isolation via context: fork: Skills that process untrusted content should use context: fork in their SKILL.md frontmatter. This runs the skill as an isolated sub-agent — its context does not flow back to the parent agent's reasoning.

Structured frontmatter as parsing boundary: When processing intake items, extract metadata from YAML frontmatter (structured, validated) rather than from prose content (unstructured, potentially manipulative). The agent processes fields, not arguments.

Session boundaries: Each Claude Code conversation starts with a fresh context. Long-running sessions that accumulate untrusted content are higher risk than short, focused sessions.

Residual risk: Within a single session, the main agent's context accumulates all content it has read. There is no mechanism to "forget" potentially poisoned content mid-session. Mitigation: keep sessions focused, and use fork skills for untrusted content processing.

5. MCP Tool Output Validation

Threat: MCP tools return data from external services (Azure DevOps, GitHub, Taskmaster). This data is untrusted — a compromised or misconfigured external service could return malicious content.

Current Controls

Control Mechanism Limitation
System prompt Agents instructed to flag suspicious tool results Convention-based, not structural
PostToolUse hooks security-lint scans file writes triggered by tool results Only catches writes, not reasoning influence
Version pinning MCP packages pinned to known versions Protects against package compromise, not service compromise

Gap: No Schema Validation

MCP tool outputs are not validated against a schema before the agent processes them. A malicious response from an MCP server could include unexpected fields, oversized payloads, or content designed to influence agent reasoning.

Recommendation: For high-risk MCP tools (those that return user-generated content from external services), consider adding a PostToolUse hook that validates response structure before the agent processes it.

6. Safe Patterns for Untrusted Content Processing

Intake Pipeline as Security Boundary

The intake pipeline (intake/inbox/ -> /inbox-qualify -> intake/knowledge/ or intake/archive/) serves as the primary security boundary for external content entering the toolkit.

Key controls at this boundary: - YAML frontmatter validation (malformed = auto-archive) - Slug sanitization (path traversal and shell metacharacters stripped) - target_skill validation against manifest - Path traversal blocking for extracted file paths - Interactive approval before routing (human in the loop)

Quarantine Pattern

Suspicious content is moved to intake/archive/quarantine/ rather than deleted. This preserves evidence for investigation while preventing the content from being processed. See Incident Response, Section 3.

Tiered Repo Evaluation

External repositories are evaluated through a 4-level safety protocol (documented in Security Governance):

  1. Level 1 (API only): GitHub API metadata — no code downloaded
  2. Level 2 (Shallow clone): Read-only analysis in a temporary directory
  3. Level 3 (Deep analysis): Full clone with pattern extraction
  4. Level 4 (Integration): Intake proposal for toolkit integration

Each level requires explicit user approval before escalation.

Quick Reference

Threat Primary Defense Secondary Defense Gap
Prompt injection (articles) YAML validation + slug sanitization Interactive approval Content-based heuristics
Prompt injection (SKILL.md) Frontmatter audit on install Code review Runtime tool enforcement
Prompt injection (MCP results) System prompt flagging PostToolUse hooks No structural defense
Agent tool misuse allowed-tools + block-writes.sh bypassPermissions awareness Runtime enforcement
Sub-agent escalation Convention (max_subagent_tools) Code review No runtime enforcement
Hallucination Verification-before-completion Multi-agent cross-checking Cannot prevent, only detect
Chain-of-thought poisoning context: fork isolation Session boundaries In-session accumulation
MCP output injection System prompt + hooks Version pinning No schema validation