AI & Agent Security¶
Threats and defense patterns specific to AI-assisted development and agentic workflows. This complements the infrastructure-focused Security Overview with practitioner guidance for building and operating agent systems safely.
1. Prompt Injection Defense¶
Threat: Malicious content manipulates agent behavior by injecting instructions into data the agent processes. See Security Overview, Section 1.2 for the threat model.
Attack Vectors in the Toolkit¶
| Vector | Entry Point | Example |
|---|---|---|
| Crafted article | intake/inbox/ |
Markdown with hidden instructions in YAML frontmatter |
| Malicious SKILL.md | Plugin installation | SKILL.md with injected prompts in description field |
| MCP tool results | Any MCP tool call | External service returns data containing instructions |
| Proposal injection | intake/pending/ |
Proposal with target_skill pointing to a sensitive skill |
Defense Patterns¶
Structured parsing over free-text interpretation: Process metadata (YAML frontmatter, structured fields) separately from prose content. The intake pipeline's Step 2a-bis validates YAML structure before the agent interprets the article content.
Input validation as first gate: /inbox-qualify validates YAML frontmatter, sanitizes slugs (strips .., /, \, shell metacharacters), and validates target_skill against the skill manifest. Malformed YAML is auto-classified as archive.
SKILL.md frontmatter audit: setup.sh reads allowed-tools declarations during installation and warns about skills declaring write-capable tools (Bash, Write, Edit, NotebookEdit). This creates an audit trail.
Tool result flagging: The system prompt instructs agents to flag suspected prompt injection in MCP tool results. This is a convention-based defense (agents are told to be suspicious of tool output), not a structural one.
2. Agent Tool Misuse Prevention¶
Threat: Agents execute tools beyond their intended scope, either through misconfiguration or through sub-agent privilege escalation. See Security Overview, Section 1.4 and Agent Architecture Patterns.
Defense Layers¶
allowed-tools as declaration of intent: Skills declare their tool requirements in SKILL.md frontmatter. This is a convention — the Claude Code runtime does not enforce it — but it enables audit and review.
block-writes.sh for hard enforcement: Read-only agents (review agents, research agents) can use the block-writes.sh hook script to structurally prevent write operations, regardless of what the agent attempts.
bypassPermissions cascade risk: When an orchestrator runs with bypassPermissions, that privilege flows to all sub-agents it spawns. This is a known Claude Code behavior. Orchestrators should spawn sub-agents with the minimum required permission mode.
Sub-agent privilege boundaries: A read-only review agent can dispatch a general-purpose sub-agent with full tool access. The runtime does not enforce the parent's restrictions on children. Defense: orchestrators should declare max_subagent_tools in their agent definitions, and code review should catch violations.
MCP server whitelisting: enableAllProjectMcpServers: false in settings prevents project-level .mcp.json files from automatically activating MCP servers. Only explicitly approved servers are available.
3. Hallucination Risks in Agentic Workflows¶
Threat: Agents fabricate information about tools, files, or status, leading to incorrect actions or false confidence.
Hallucination Types¶
| Type | Example | Impact |
|---|---|---|
| Tool call hallucination | Agent invokes a tool that doesn't exist or fabricates parameters | Error at best, data corruption at worst |
| Path hallucination | Agent references files that don't exist | Wasted effort, incorrect recommendations |
| Status hallucination | Agent claims "tests pass" without running them | False confidence, bugs reach production |
| Content hallucination | Agent generates plausible but incorrect code or facts | Subtle bugs, incorrect documentation |
Defense Patterns¶
Verification before completion: The /vt-c-verification-before-completion skill requires agents to run verification commands and confirm output before claiming success. "Evidence before assertions" — no success claim without proof.
Convergence checks: The /vt-c-ralph-wiggum-loop skill prevents completion claims without evidence of convergence. It wraps verification in a loop that checks whether output actually matches expectations.
Multi-agent cross-checking: /vt-c-4-review dispatches 6 parallel review agents. If one agent hallucinates a finding, the others are unlikely to produce the same hallucination. Consensus across independent agents reduces hallucination risk.
Spec-driven validation: The spec compliance reviewer checks implementation against specification acceptance criteria. This structural check catches agents that claim to have implemented features they skipped.
4. Chain-of-Thought Poisoning¶
Threat: Crafted content influences the agent's reasoning across tool boundaries, causing it to make decisions that serve the attacker's goals rather than the user's.
How It Works¶
- Agent reads an external article via
/inbox-qualify - The article contains persuasive (but false) technical claims
- The agent's reasoning is influenced when it later processes unrelated work
- Decisions are subtly biased by the injected framing
Defense Patterns¶
Context isolation via context: fork: Skills that process untrusted content should use context: fork in their SKILL.md frontmatter. This runs the skill as an isolated sub-agent — its context does not flow back to the parent agent's reasoning.
Structured frontmatter as parsing boundary: When processing intake items, extract metadata from YAML frontmatter (structured, validated) rather than from prose content (unstructured, potentially manipulative). The agent processes fields, not arguments.
Session boundaries: Each Claude Code conversation starts with a fresh context. Long-running sessions that accumulate untrusted content are higher risk than short, focused sessions.
Residual risk: Within a single session, the main agent's context accumulates all content it has read. There is no mechanism to "forget" potentially poisoned content mid-session. Mitigation: keep sessions focused, and use fork skills for untrusted content processing.
5. MCP Tool Output Validation¶
Threat: MCP tools return data from external services (Azure DevOps, GitHub, Taskmaster). This data is untrusted — a compromised or misconfigured external service could return malicious content.
Current Controls¶
| Control | Mechanism | Limitation |
|---|---|---|
| System prompt | Agents instructed to flag suspicious tool results | Convention-based, not structural |
| PostToolUse hooks | security-lint scans file writes triggered by tool results |
Only catches writes, not reasoning influence |
| Version pinning | MCP packages pinned to known versions | Protects against package compromise, not service compromise |
Gap: No Schema Validation¶
MCP tool outputs are not validated against a schema before the agent processes them. A malicious response from an MCP server could include unexpected fields, oversized payloads, or content designed to influence agent reasoning.
Recommendation: For high-risk MCP tools (those that return user-generated content from external services), consider adding a PostToolUse hook that validates response structure before the agent processes it.
6. Safe Patterns for Untrusted Content Processing¶
Intake Pipeline as Security Boundary¶
The intake pipeline (intake/inbox/ -> /inbox-qualify -> intake/knowledge/ or intake/archive/) serves as the primary security boundary for external content entering the toolkit.
Key controls at this boundary:
- YAML frontmatter validation (malformed = auto-archive)
- Slug sanitization (path traversal and shell metacharacters stripped)
- target_skill validation against manifest
- Path traversal blocking for extracted file paths
- Interactive approval before routing (human in the loop)
Quarantine Pattern¶
Suspicious content is moved to intake/archive/quarantine/ rather than deleted. This preserves evidence for investigation while preventing the content from being processed. See Incident Response, Section 3.
Tiered Repo Evaluation¶
External repositories are evaluated through a 4-level safety protocol (documented in Security Governance):
- Level 1 (API only): GitHub API metadata — no code downloaded
- Level 2 (Shallow clone): Read-only analysis in a temporary directory
- Level 3 (Deep analysis): Full clone with pattern extraction
- Level 4 (Integration): Intake proposal for toolkit integration
Each level requires explicit user approval before escalation.
Quick Reference¶
| Threat | Primary Defense | Secondary Defense | Gap |
|---|---|---|---|
| Prompt injection (articles) | YAML validation + slug sanitization | Interactive approval | Content-based heuristics |
| Prompt injection (SKILL.md) | Frontmatter audit on install | Code review | Runtime tool enforcement |
| Prompt injection (MCP results) | System prompt flagging | PostToolUse hooks | No structural defense |
| Agent tool misuse | allowed-tools + block-writes.sh | bypassPermissions awareness | Runtime enforcement |
| Sub-agent escalation | Convention (max_subagent_tools) | Code review | No runtime enforcement |
| Hallucination | Verification-before-completion | Multi-agent cross-checking | Cannot prevent, only detect |
| Chain-of-thought poisoning | context: fork isolation | Session boundaries | In-session accumulation |
| MCP output injection | System prompt + hooks | Version pinning | No schema validation |
Related Documents¶
- Security Overview — infrastructure threat model, controls matrix
- Hardening Guide — deployment hardening checklist
- Incident Response — response procedures per incident type
- Security Governance — deny rules, drift audit, repo evaluation
- Agent Architecture Patterns — hook hierarchy, policy islands, privilege cascade