AI & Agent Security¶

Threats and defense patterns specific to AI-assisted development and agentic workflows. This complements the infrastructure-focused Security Overview with practitioner guidance for building and operating agent systems safely.

1. Prompt Injection Defense¶

Threat: Malicious content manipulates agent behavior by injecting instructions into data the agent processes. See Security Overview, Section 1.2 for the threat model.

Attack Vectors in the Toolkit¶

Vector	Entry Point	Example
Crafted article	`intake/inbox/`	Markdown with hidden instructions in YAML frontmatter
Malicious SKILL.md	Plugin installation	SKILL.md with injected prompts in description field
MCP tool results	Any MCP tool call	External service returns data containing instructions
Proposal injection	`intake/pending/`	Proposal with `target_skill` pointing to a sensitive skill

Defense Patterns¶

Structured parsing over free-text interpretation: Process metadata (YAML frontmatter, structured fields) separately from prose content. The intake pipeline's Step 2a-bis validates YAML structure before the agent interprets the article content.

Input validation as first gate: /inbox-qualify validates YAML frontmatter, sanitizes slugs (strips .., /, \, shell metacharacters), and validates target_skill against the skill manifest. Malformed YAML is auto-classified as archive.

SKILL.md frontmatter audit: setup.sh reads allowed-tools declarations during installation and warns about skills declaring write-capable tools (Bash, Write, Edit, NotebookEdit). This creates an audit trail.

Tool result flagging: The system prompt instructs agents to flag suspected prompt injection in MCP tool results. This is a convention-based defense (agents are told to be suspicious of tool output), not a structural one.

2. Agent Tool Misuse Prevention¶

Threat: Agents execute tools beyond their intended scope, either through misconfiguration or through sub-agent privilege escalation. See Security Overview, Section 1.4 and Agent Architecture Patterns.

Defense Layers¶

allowed-tools as declaration of intent: Skills declare their tool requirements in SKILL.md frontmatter. This is a convention — the Claude Code runtime does not enforce it — but it enables audit and review.

block-writes.sh for hard enforcement: Read-only agents (review agents, research agents) can use the block-writes.sh hook script to structurally prevent write operations, regardless of what the agent attempts.

bypassPermissions cascade risk: When an orchestrator runs with bypassPermissions, that privilege flows to all sub-agents it spawns. This is a known Claude Code behavior. Orchestrators should spawn sub-agents with the minimum required permission mode.

Sub-agent privilege boundaries: A read-only review agent can dispatch a general-purpose sub-agent with full tool access. The runtime does not enforce the parent's restrictions on children. Defense: orchestrators should declare max_subagent_tools in their agent definitions, and code review should catch violations.

MCP server whitelisting: enableAllProjectMcpServers: false in settings prevents project-level .mcp.json files from automatically activating MCP servers. Only explicitly approved servers are available.

3. Hallucination Risks in Agentic Workflows¶

Threat: Agents fabricate information about tools, files, or status, leading to incorrect actions or false confidence.

Hallucination Types¶

Type	Example	Impact
Tool call hallucination	Agent invokes a tool that doesn't exist or fabricates parameters	Error at best, data corruption at worst
Path hallucination	Agent references files that don't exist	Wasted effort, incorrect recommendations
Status hallucination	Agent claims "tests pass" without running them	False confidence, bugs reach production
Content hallucination	Agent generates plausible but incorrect code or facts	Subtle bugs, incorrect documentation

Defense Patterns¶

Verification before completion: The /vt-c-verification-before-completion skill requires agents to run verification commands and confirm output before claiming success. "Evidence before assertions" — no success claim without proof.

Convergence checks: The /vt-c-ralph-wiggum-loop skill prevents completion claims without evidence of convergence. It wraps verification in a loop that checks whether output actually matches expectations.

Multi-agent cross-checking: /vt-c-4-review dispatches 6 parallel review agents. If one agent hallucinates a finding, the others are unlikely to produce the same hallucination. Consensus across independent agents reduces hallucination risk.

Spec-driven validation: The spec compliance reviewer checks implementation against specification acceptance criteria. This structural check catches agents that claim to have implemented features they skipped.

4. Chain-of-Thought Poisoning¶

Threat: Crafted content influences the agent's reasoning across tool boundaries, causing it to make decisions that serve the attacker's goals rather than the user's.

How It Works¶

Agent reads an external article via /inbox-qualify
The article contains persuasive (but false) technical claims
The agent's reasoning is influenced when it later processes unrelated work
Decisions are subtly biased by the injected framing

Defense Patterns¶

Context isolation via context: fork: Skills that process untrusted content should use context: fork in their SKILL.md frontmatter. This runs the skill as an isolated sub-agent — its context does not flow back to the parent agent's reasoning.

Structured frontmatter as parsing boundary: When processing intake items, extract metadata from YAML frontmatter (structured, validated) rather than from prose content (unstructured, potentially manipulative). The agent processes fields, not arguments.

Session boundaries: Each Claude Code conversation starts with a fresh context. Long-running sessions that accumulate untrusted content are higher risk than short, focused sessions.

Residual risk: Within a single session, the main agent's context accumulates all content it has read. There is no mechanism to "forget" potentially poisoned content mid-session. Mitigation: keep sessions focused, and use fork skills for untrusted content processing.

5. MCP Tool Output Validation¶

Threat: MCP tools return data from external services (Azure DevOps, GitHub, Taskmaster). This data is untrusted — a compromised or misconfigured external service could return malicious content.

Current Controls¶

Control	Mechanism	Limitation
System prompt	Agents instructed to flag suspicious tool results	Convention-based, not structural
PostToolUse hooks	`security-lint` scans file writes triggered by tool results	Only catches writes, not reasoning influence
Version pinning	MCP packages pinned to known versions	Protects against package compromise, not service compromise

Gap: No Schema Validation¶

MCP tool outputs are not validated against a schema before the agent processes them. A malicious response from an MCP server could include unexpected fields, oversized payloads, or content designed to influence agent reasoning.

Recommendation: For high-risk MCP tools (those that return user-generated content from external services), consider adding a PostToolUse hook that validates response structure before the agent processes it.

6. Safe Patterns for Untrusted Content Processing¶

Intake Pipeline as Security Boundary¶

The intake pipeline (intake/inbox/ -> /inbox-qualify -> intake/knowledge/ or intake/archive/) serves as the primary security boundary for external content entering the toolkit.

Key controls at this boundary: - YAML frontmatter validation (malformed = auto-archive) - Slug sanitization (path traversal and shell metacharacters stripped) - target_skill validation against manifest - Path traversal blocking for extracted file paths - Interactive approval before routing (human in the loop)

Quarantine Pattern¶

Suspicious content is moved to intake/archive/quarantine/ rather than deleted. This preserves evidence for investigation while preventing the content from being processed. See Incident Response, Section 3.

Tiered Repo Evaluation¶

External repositories are evaluated through a 4-level safety protocol (documented in Security Governance):

Level 1 (API only): GitHub API metadata — no code downloaded
Level 2 (Shallow clone): Read-only analysis in a temporary directory
Level 3 (Deep analysis): Full clone with pattern extraction
Level 4 (Integration): Intake proposal for toolkit integration

Each level requires explicit user approval before escalation.

Quick Reference¶

Threat	Primary Defense	Secondary Defense	Gap
Prompt injection (articles)	YAML validation + slug sanitization	Interactive approval	Content-based heuristics
Prompt injection (SKILL.md)	Frontmatter audit on install	Code review	Runtime tool enforcement
Prompt injection (MCP results)	System prompt flagging	PostToolUse hooks	No structural defense
Agent tool misuse	allowed-tools + block-writes.sh	bypassPermissions awareness	Runtime enforcement
Sub-agent escalation	Convention (max_subagent_tools)	Code review	No runtime enforcement
Hallucination	Verification-before-completion	Multi-agent cross-checking	Cannot prevent, only detect
Chain-of-thought poisoning	context: fork isolation	Session boundaries	In-session accumulation
MCP output injection	System prompt + hooks	Version pinning	No schema validation

Security Overview — infrastructure threat model, controls matrix
Hardening Guide — deployment hardening checklist
Incident Response — response procedures per incident type
Security Governance — deny rules, drift audit, repo evaluation
Agent Architecture Patterns — hook hierarchy, policy islands, privilege cascade

AI & Agent Security¶

1. Prompt Injection Defense¶

Attack Vectors in the Toolkit¶

Defense Patterns¶

2. Agent Tool Misuse Prevention¶

Defense Layers¶

3. Hallucination Risks in Agentic Workflows¶

Hallucination Types¶

Defense Patterns¶

4. Chain-of-Thought Poisoning¶

How It Works¶

Defense Patterns¶

5. MCP Tool Output Validation¶

Current Controls¶

Gap: No Schema Validation¶

6. Safe Patterns for Untrusted Content Processing¶

Intake Pipeline as Security Boundary¶

Quarantine Pattern¶

Tiered Repo Evaluation¶

Quick Reference¶

Related Documents¶