Defensive Agent Safety: Best Practices for Engineers Who Worry Their Agents Could Be Turned Against Them
Here's the fear I keep hearing from engineers building defensive AI agents: what if the thing I built to protect my organization becomes the most effective weapon against it? It's not paranoia. A defensive agent with broad permissions — read access to logs, write access to response playbooks, credential access to quarantine endpoints — is a high-value target. If an attacker can manipulate it, they get all of that for free.
I'm still learning this space, and I don't have all the answers. But I've been collecting the best thinking I can find on agent safety, and I want to share the patterns that resonate most with me as a practitioner.
The Core Threat Model: Your Agent as an Attack Surface
Start by threat-modeling your own agent. Ask the uncomfortable questions:
- What data can this agent read? If an attacker controls any of that data, they may be able to inject instructions into the agent's context.
- What actions can this agent take? Every capability is a potential weapon if the agent is manipulated.
- What credentials does this agent hold? These are the crown jewels. Treat them as if they're already compromised in your threat model.
- How does this agent handle unexpected or adversarial input? Most agents haven't been hardened against adversarial prompts the way web applications are hardened against SQL injection.
Indirect prompt injection is the primary vector here. The OWASP LLM Top 10 rates it as a top risk, and for good reason: an agent processing attacker-controlled content (a phishing email, a malicious log entry, a poisoned document in the vector store) may execute instructions embedded in that content. This isn't a model alignment problem — it's an input validation problem, and it deserves the same engineering rigor as any other injection class.
Containment Principles
The single most effective safety measure is minimizing what an agent can do. This sounds obvious, but in practice, agents tend to accumulate permissions over time as new capabilities are added. Fight this tendency aggressively:
- Least privilege, always: Give the agent only the permissions it needs for its current task. If the agent is doing triage, it doesn't need write access. If it's writing a report, it doesn't need access to production credentials.
- Ephemeral credentials: Don't give agents long-lived API keys. Use short-lived tokens with automatic rotation. If an agent is compromised, the damage window is bounded by the credential lifetime.
- Read-before-write gates: For agents that do take action (quarantine, block, remediate), add a human-in-the-loop gate before consequential write actions. The agent proposes; a human approves. This is slower, but the blast radius of a manipulated agent is dramatically reduced.
- Tool surface minimization: Every tool you expose to an agent is an attack surface. Audit your tool list regularly and remove tools that aren't being used in production workflows.
Input Validation and Output Filtering
Treat agent inputs and outputs like untrusted data flowing through a web application:
- Sanitize retrieved content: If your agent retrieves data from external sources (emails, web pages, documents), strip or escape anything that looks like an instruction or a prompt. This is imperfect — there's no perfect sanitizer for natural language — but it meaningfully raises the difficulty of injection attacks.
- Structural prompts over free-form prompts: Where possible, use structured inputs (JSON, YAML, templated fields) rather than free-form text that the agent reasons over holistically. Structure reduces the surface for embedded instructions.
- Output classifiers: Before an agent's output is acted upon or displayed, run it through a classifier that flags anomalies: unexpected commands, unusual formatting, references to external URLs that weren't in the original task. A lightweight classifier won't catch everything, but it adds a layer.
Observability as a Safety Control
You can't defend what you can't see. Agentic systems are particularly hard to observe because the reasoning happens inside the model — you see the inputs and outputs, but not the chain of thought (unless you're logging it explicitly). Build observability in from the start:
- Log every tool call with its inputs and outputs, the agent's stated reasoning (if available), and a timestamp.
- Set up anomaly alerts for unusual tool-call sequences — e.g., an agent that normally reads logs suddenly attempting to call an external API.
- Maintain an immutable audit trail. If an agent is ever implicated in an incident, you need to reconstruct exactly what it did.
- Treat agent logs as security-sensitive artifacts. They may contain credential fragments, sensitive data retrieved during a task, or evidence of manipulation attempts.
Ethics and Organizational Responsibility
This is the part I find hardest to write, because it's not purely technical. Defensive agents operate on behalf of an organization, and that organization is responsible for what those agents do — even if a human didn't directly authorize a specific action. Some principles I think are worth holding:
- Document the agent's decision authority: Before deploying, write down exactly what this agent is authorized to do, what it is not authorized to do, and under what conditions its authority can be expanded. Treat this like a policy document, not an engineering note.
- Plan for failure modes: What happens if the agent takes a wrong action? Is there a rollback procedure? Is the organization prepared to communicate what happened to affected parties?
- Avoid deploying agents in adversarial environments without extensive testing: An agent tested only in controlled lab conditions may behave very differently when it encounters real attacker behavior. Red-team the agent before giving it production access.
- Revisit permissions as the threat landscape evolves: A permission that was safe when you first deployed may not be safe six months later when new attack patterns emerge. Agent permissions should be re-reviewed on a regular schedule.
What I'm Still Figuring Out
I don't think anyone has fully solved the problem of adversarially robust agentic AI yet. The academic literature on prompt injection is still relatively young. The tooling for agent observability is immature. The organizational processes for governing autonomous AI actions are being invented in real time.
What I'm confident about: the engineers who build these controls proactively — before an incident forces the conversation — will be better positioned than those who treat safety as an afterthought. The security community has learned this lesson the hard way with web applications, cloud infrastructure, and API security. The agentic AI layer is the next frontier, and the window to build it right is now.
References & Further Reading
- OWASP. LLM Top 10 for Large Language Model Applications — LLM02: Prompt Injection. https://owasp.org/www-project-top-10-for-large-language-model-applications/
- Greshake, Kai et al. Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv, 2023. https://arxiv.org/abs/2302.12173
- Anthropic. Claude's Model Specification — Avoiding Harm. https://www.anthropic.com/model-spec
- MITRE ATLAS. Adversarial Threat Landscape for AI Systems. https://atlas.mitre.org/
- NIST. AI Risk Management Framework (AI RMF 1.0). https://www.nist.gov/system/files/documents/2023/01/26/AI%20RMF%201.0.pdf
- Perez, Fábio and Ian Ribeiro. Ignore Previous Prompt: Attack Techniques For Language Models. arXiv, 2022. https://arxiv.org/abs/2211.09527
- Google DeepMind. Frontier Safety Framework. https://deepmind.google/discover/blog/introducing-the-frontier-safety-framework/
- OpenAI. Preparedness Framework. https://openai.com/safety/preparedness