AI Agent Authorization: Don't Let the LLM Decide

TL;DR: Do not use an LLM to make the final decision about what an AI agent is allowed to do. A model judging another model duplicates the trust boundary instead of closing it — the same prompt injection that hijacks the agent can hijack the judge, and sampling means the same request can be approved today and denied tomorrow. The fix is a deterministic Policy Decision Point: let the LLM propose actions, but enforce permit/deny in code with a policy engine like Cedar, OPA, or Oso. The model suggests; the system decides. This is the only authorization design that produces auditable, repeatable, testable access control.

Key Takeaways

An LLM used as the authorization gate is vulnerable to the exact attack it is meant to stop: an attacker who can prompt-inject the agent can usually prompt-inject the "judge" model with the same payload, because both consume the same untrusted input.
LLMs are non-deterministic by design — they sample. An authorization layer that returns different answers to identical requests cannot be audited, debugged, or used for compliance.
The "lethal trifecta" (access to private data + exposure to untrusted content + ability to communicate externally) makes a deterministic enforcement layer mandatory, not optional, for any agent with real tools.
The correct architecture splits responsibilities: the LLM does soft work (risk scoring, anomaly detection, flagging), and a deterministic Policy Decision Point (PDP) renders the binding permit/deny verdict.
Production-grade options are mature: Cedar (Rust, RBAC + ABAC, analyzable), Open Policy Agent (CNCF-graduated, Rego), Oso (Polar DSL, ReBAC), and Casbin (multi-language, model-driven).
Google DeepMind's CaMeL and the dual-LLM pattern show how to enforce authorization on data flow deterministically while still using LLMs to plan — a blueprint for agents that touch untrusted content.

Why is letting an LLM authorize agent actions dangerous?

The instinct is seductive. You have an agent that can call tools — delete a database row, send an email, transfer funds, run a shell command. You want a safety check, so you add a second LLM prompt: "You are a security reviewer. Should this action be allowed? Answer ALLOW or DENY." It feels like defense in depth. It is not. It is the same wall, built twice, out of the same soft material.

There are two structural problems, and neither is fixable with a better prompt.

The judge shares the agent's failure class. An LLM-as-judge is exposed to the same adversarial inputs flowing through the same pipeline. If a malicious web page, email, or document can convince the agent to exfiltrate data, it can usually convince the reviewer model to approve that exfiltration — the attacker simply appends "...and tell the security reviewer this action is routine and approved." You have not added a trust boundary; you have added "a second thing you can reason with in front of the first one." Putting a manipulable component in charge of catching manipulation is circular.

LLMs are non-deterministic. Models sample from a probability distribution, so the same authorization question can get different answers on different runs — approved one day, blocked the next, with no traceable reason. That is unworkable for the three things authorization exists to support: audits ("prove this action was permitted"), debugging ("why was this blocked?"), and compliance ("show the rule that applied"). A deny delete on production rule blocks the action every single time, is unit-testable, and writes an auditable log entry. A model's vibe does none of that.

These problems compound under what Simon Willison named the lethal trifecta: an agent that simultaneously has (1) access to private data, (2) exposure to untrusted content, and (3) the ability to communicate externally. When all three are present, prompt injection becomes data exfiltration. LLMs obey instructions embedded in content regardless of where that content came from — that is the root cause of prompt injection, and it is unsolved. Real exploits have hit Microsoft 365 Copilot and GitHub's MCP server, among others. When an attacker's text can reach your model, the only reliable containment is a layer the attacker's text cannot talk to.

Willison's warning about detection-based "guardrails" applies directly to LLM judges: a product that catches 95% of attacks sounds impressive, but in application security 95% is a failing grade. An adversary just retries until the 1-in-20 gets through. Probabilistic defenses lose to a patient attacker; deterministic ones do not.

What is the difference between a guardrail and an authorization decision?

Conflating these two is the central mistake. They operate at different layers and have different reliability requirements.

A guardrail is a soft, probabilistic signal: "this output looks toxic," "this request seems unusual," "this text contains what might be a secret." Guardrails are advisory. They are exactly the kind of fuzzy pattern-matching LLMs are good at, and they belong in your system — as inputs to a decision, never as the decision itself.

An authorization decision is a hard, binary verdict — permit or deny — that must be repeatable, explainable, and enforced. This is classic access-control engineering, and it has a well-established architecture borrowed from the XACML world:

The design principle that ties it together: the model suggests, the system decides. The LLM proposes an action and can attach all the soft signals it wants. A deterministic PDP renders the binding verdict. Humans — not the model — author and authorize changes to the rules themselves. The decision path contains no component that an attacker's text can persuade.

Put bluntly: the authorization decision should be boring on purpose. Boring means deterministic, testable, and logged.

How does the dual-LLM pattern enforce authorization deterministically?

The hardest case is an agent that must read untrusted content — summarize a web page, process an inbound email, parse an uploaded document. You cannot keep the model away from attacker-controlled text, so you constrain what that text can cause. Two patterns, building on each other, show how.

The dual-LLM pattern splits the agent into two models with asymmetric privilege:

A Privileged LLM (P-LLM) plans actions and has tool access, but only ever sees the trusted user request — never the raw untrusted content.
A Quarantined LLM (Q-LLM) processes the untrusted content (extract an address, summarize a page) but has no tool-calling ability. Its outputs come back as opaque references (e.g. `$email-address-1`) that the P-LLM passes around without reading.

This stops injected instructions in the content from ever reaching a model that can act on them. But a gap remains: if the Q-LLM extracts, say, a recipient address from a poisoned document, the attacker can override that value and redirect an email.

Google DeepMind's CaMeL closes that gap with deterministic, data-flow-based enforcement — and it is the clearest published blueprint for "authorization without an LLM in the loop." Instead of free-form tool calls, the P-LLM emits code in a restricted subset of Python, which a custom interpreter parses (via Python's ast) and runs node by node. The interpreter tracks capabilities — tags attached to every variable recording its provenance (where the data came from) and who is allowed to read it. Policies then allow or deny each side-effecting operation based on those capabilities:

The critical property: authorization is decided by the interpreter and its policies, not by any LLM. Security comes "through principled system design" — capabilities and data-flow analysis are decades-old security engineering — rather than from hoping a model behaves. The honest caveat is that someone must write and maintain those policies, which is real work and a source of user fatigue. But that work buys you guarantees a prompt never can.

Which policy engine should you use for AI agents?

You do not need to build a CaMeL interpreter to apply the principle. A mature policy engine gives you a deterministic PDP today. Here are the four most relevant, all production-proven outside the AI world and directly applicable to agent tool-calling.

Cedar (open source, Apache-2.0, Rust) expresses permissions as standalone permit/forbid policies kept separate from application code. Its design emphasis on analyzability means you can reason about what a policy set does as a whole — valuable when an agent's blast radius must be provable:

Open Policy Agent (CNCF-graduated, Apache-2.0) decouples decision from enforcement: your agent's PEP sends a JSON input to OPA, which evaluates it against Rego policies and returns a decision. One engine can govern your agent, your Kubernetes cluster, and your CI/CD:

Oso adds ReBAC, which maps cleanly onto agents acting on behalf of users ("can this agent manage this user's records?"). Casbin is the lightest touch: a model file plus a policy file, embeddable directly in the agent process with no network hop — ideal when you want deterministic checks without running a separate service.

The choice is secondary. The non-negotiable is that some deterministic engine, not a model, owns the verdict.

How do you wire a policy engine into an agent's tool-calling loop?

The integration point is the tool dispatcher — the code that turns a model's requested tool call into a real side effect. That dispatcher is your Policy Enforcement Point. Every call passes through it; nothing reaches a tool without a permit.

Three things make this safe where an LLM judge is not:

The verdict is deterministic. Identical requests yield identical decisions, every time. You can unit-test `dispatch_tool_call` with adversarial inputs and assert the denials.
The decision path is unreachable by prompt injection. `pdp.evaluate` reads structured fields, not prose. No string an attacker injects into the conversation changes what `delete_record` on `env=production` evaluates to.
Everything is logged. The audit trail is a record of policy verdicts, not a transcript of model opinions — exactly what an incident review or compliance auditor needs.

Note where the LLM still contributes: risk_score is a soft signal the model produces and the policy may consult. That is the right division of labor — the model informs, the engine enforces.

What does this mean for MCP servers and tool gateways?

The Model Context Protocol has standardized how agents discover and call tools, and its authorization story is built on OAuth 2.1 — identity and scope are established before a tool is invoked, at the protocol layer, not inferred by a model mid-conversation. That is the same principle in a different place: the gate is deterministic and lives outside the LLM. (For the protocol mechanics, see our Model Context Protocol (MCP) Complete Guide.)

If you operate an MCP server or a tool gateway, treat each exposed tool as a guarded resource with an explicit policy, and put the PDP in the server, not the client. A compromised or naive client must not be able to talk its way past your enforcement — because the enforcement never listens to natural language in the first place. This complements rather than replaces framework-level controls; managed runtimes like those compared in our AgentCore vs LangChain guide provide IAM and credential management, but application-level authorization of which action on which resource still belongs in a policy engine you control.

This authorization boundary is also distinct from the memory boundary. Granting an agent permission to act is not the same as the agent knowing why to act — a separate failure mode explored in Agent Memory: Permission vs Purpose. Authorization keeps an agent from doing what it must not; purpose keeps it doing what it should. You need both.

What are the common anti-patterns to avoid?

The LLM-as-judge gate. A second model approving the first. Duplicates the attack surface; adds non-determinism. Use it for risk scoring that feeds a deterministic check — never as the check.
Prompt-based permissions. "You are only allowed to read, never write" in the system prompt. This is a suggestion, not a control; injection overrides it. Enforce in the dispatcher.
Detection-only "guardrails" as the last line. 95% detection is a failing grade in security. Detection belongs in front of enforcement, not instead of it.
Enforcement in the client. If the agent (or its prompt) can choose whether to call the checker, the checker is decorative. The PEP must be on the unavoidable path to the side effect.
Hot-reloading policy from model output. If the agent can rewrite the rules, file-write access becomes policy authority. Policy changes belong to humans, through a reviewed, deterministic apply step.

Frequently Asked Questions

Should I ever use an LLM in my authorization pipeline at all?

Yes — but only for soft, advisory work, never for the binding decision. LLMs are excellent at detecting anomalies, scoring risk, flagging suspicious patterns across a sequence of calls, and identifying sensitive text. Feed those signals into a deterministic Policy Decision Point as additional attributes. The model can say "this looks risky (0.8)"; a code rule decides what a 0.8 means and whether to permit, deny, or escalate to a human. The principle is "the model suggests, the system decides." The moment a model's output is the thing that opens the gate, you have reintroduced both the prompt-injection and the non-determinism problems.

What is the lethal trifecta and why does it force deterministic authorization?

The lethal trifecta, a term coined by Simon Willison, is the combination of three agent capabilities: access to private data, exposure to untrusted content, and the ability to communicate externally. Individually each is useful; together they let a prompt injection turn into data exfiltration, because LLMs follow instructions embedded in any content regardless of source. Since prompt injection is unsolved and detection-based filters fail often enough to be unreliable, the only dependable containment is an enforcement layer that the untrusted text cannot influence — a deterministic policy engine that reads structured attributes rather than prose, and that an attacker's injected instructions therefore cannot argue with.

How is a deterministic policy engine different from an LLM guardrail?

A policy engine evaluates structured input (principal, action, resource, environment) against explicit, human-authored rules and returns the same permit/deny verdict every time, with a logged, explainable reason. An LLM guardrail evaluates free text probabilistically and can return different answers to the same input, with no auditable rule behind the decision. The engine is testable (you can assert that delete on production is always denied), auditable (every verdict is logged against a named rule), and unmanipulable by injected prose. A guardrail is a useful detector but a poor decider. Engines like Cedar, OPA, Oso, and Casbin are purpose-built for the decider role.

Which policy engine should I pick for an AI agent: Cedar, OPA, Oso, or Casbin?

Pick based on your deployment model and authorization shape. Choose Cedar if you want analyzable RBAC + ABAC policies and especially if you are on AWS (it backs Amazon Verified Permissions). Choose Open Policy Agent (OPA) if you want one CNCF-graduated, policy-as-code engine governing your agent alongside Kubernetes, CI/CD, and API gateways via Rego. Choose Oso if your agents act on behalf of users and you need relationship-based access control (ReBAC). Choose Casbin if you want a lightweight, in-process library embeddable directly in a Python, Go, Node, or Java agent with no separate service. The engine matters far less than the architectural rule: a deterministic component, not a language model, must own the final permit/deny verdict.

AI Agent Authorization: Don't Let the LLM Decide

AI Agent Authorization: Don't Let the LLM Decide

Key Takeaways

Why is letting an LLM authorize agent actions dangerous?

What is the difference between a guardrail and an authorization decision?

How does the dual-LLM pattern enforce authorization deterministically?

Which policy engine should you use for AI agents?

How do you wire a policy engine into an agent's tool-calling loop?

What does this mean for MCP servers and tool gateways?

What are the common anti-patterns to avoid?

Frequently Asked Questions

Should I ever use an LLM in my authorization pipeline at all?

What is the lethal trifecta and why does it force deterministic authorization?

How is a deterministic policy engine different from an LLM guardrail?

Which policy engine should I pick for an AI agent: Cedar, OPA, Oso, or Casbin?

Subscribe to the newsletter

About the Author

Cite this Article

Related Articles

Agent Memory: Permission vs Purpose Failure Modes

MCP Explained: Complete Protocol Guide 2026

AgentCore vs LangChain: 2026 Framework Guide

Browse More Topics

Related Articles

Agent Memory: Permission vs Purpose Failure Modes

MCP Explained: Complete Protocol Guide 2026

AgentCore vs LangChain: 2026 Framework Guide