Skip to Content
Prompt Firewall

Prompt Firewall

The Prompt Firewall is a route-attached inspection layer that scans request payloads for sensitive text and metadata before any provider call. It is intentionally explicit and intentionally limited: a deterministic detection layer for well-defined risk patterns, not a semantic prompt-injection defense.

The Prompt Firewall is one layer of defense within the governed execution path. Use it alongside the project policy engine and provider-side guardrails; it is not a substitute for either.

Strengthen-only model

The Prompt Firewall enforces a non-weakenable platform baseline. Every project starts with the full set of platform-baseline rules in block mode. From there, projects on the right plan tier can add detection rules and tighten detection profiles. They cannot remove or relax baseline rules.

This is a deliberate, locked posture. Three properties hold for every project on every supported route:

  • The baseline is enforced. Platform-baseline rules always run in block mode for every project on every supported route.
  • Weakening is silently corrected. Attempts to set a baseline rule to allow, or to set its enabled flag to false, are silently rewritten back to the baseline. Strengthening configuration that conflicts with the baseline is normalized at write time, so the deployed state is always at or above the floor.
  • Strengthening is additive. Custom rules and stricter profiles add coverage on top of the baseline. They never replace baseline rules.

The “Option C” label captures this in one phrase: customers can strengthen, never weaken below the platform baseline. Strengthening configuration is a separate authoring surface, gated by plan tier; the baseline is universal.

Per-surface coverage matrix

The Prompt Firewall runs on most public execution surfaces. Coverage is route-scoped — a route either runs the firewall or it does not — and the field categories inspected vary by surface.

SurfaceFirewall runsMessage contentTool / function argumentsMultimodal referencesMetadata fieldsFilename / URL fields
POST /v1/executionsYesYesYesYesYesYes
POST /v1/executeYesYesYesProvider-dependentProvider-dependentProvider-dependent
POST /v1/proxy/openaiYesYesYesYesYesYes
POST /v1/proxy/anthropicYesYesYesLimitedYesNo
POST /v1/proxy/googleYesYesYesYesYesYes
POST /v1/proxy/xaiYesYesLimitedNoNoNo
POST /v1/proxy/metaYesYesLimitedLimitedNoNo
POST /v1/permitsNo
POST /v1/jobs (submission)Deferred

Reading the matrix:

  • Yes — the field category is inspected on this surface against every applicable rule.
  • No — the field category is not inspected on this surface today.
  • Limited — coverage exists but is narrower than the broader-coverage routes.
  • Provider-dependentPOST /v1/execute resolves provider/model first, then runs the corresponding provider-specific extractor; coverage matches the resolved provider’s row.
  • DeferredPOST /v1/jobs does not run the firewall on submission. Governance and firewall evaluation happen later in the worker execution path.
  • No (permits)POST /v1/permits is the canonical decision seam, not a provider-dispatch surface. Permit evaluation runs the policy engine but does not run the prompt firewall.

What gets inspected

Inspection happens on category-level field classes drawn from the request payload. The categories are common to every route that runs the firewall; specific extractor field paths are part of the runtime, not the public contract.

The categories Keel inspects on covered routes:

  • System, developer, and user message content — the conversation text the model would otherwise see directly.
  • Tool and function arguments — text or JSON sent as tool descriptions, tool-use inputs, function-call arguments, or function-response payloads.
  • Supported multimodal references — declared text descriptors and asset metadata attached to images, audio, or other multimodal inputs (captions, alt text, transcripts, asset descriptors).
  • Request metadata text — text-bearing values supplied alongside the conversation (for example, free-form metadata fields that ride along with the request).
  • Filename and URL references — filename and URL fields that surface in the request payload, with extracted-domain detection on URLs.
  • Provider-shaped passthrough fields — text-bearing values supplied to forward provider-native arguments or options.

Coverage of each category by surface is shown in the matrix above. Newly introduced text-bearing payload fields may not be inspected until coverage is extended for that route — see Scope and Limits § Content inspection.

Platform baseline rule categories

The platform baseline is enforced at block for every project on every supported route. The detection categories cover well-defined risk patterns and shapes:

CategoryWhat it covers
Credential leakageStrings shaped like API keys, bearer tokens, custom API-key headers, and PEM-style private key blocks
Identifier exposureGovernment identifier patterns (e.g. US SSN-shaped sequences) and payment-card-shaped strings
Prompt-attack shapesCommon pasteable phrasings used to redirect, override, or exfiltrate system instructions and context secrets
Sensitive resource referencesFilenames suggesting sensitive content and URL/domain shapes flagged as risky

These are deterministic pattern detectors, not semantic classifiers. They cover specific shapes the platform commits to blocking; they do not interpret intent. Specific rule identifiers and per-pattern coverage are part of the runtime, not part of the public contract — exact rule lists evolve as we add coverage.

Strengthening: custom rules and profiles

Projects on Business plans and above can strengthen their firewall posture beyond the platform baseline. Strengthening is configured under prompt_firewall_strengthening on the project policy overrides and authored through the project policy surface:

PATCH /v1/projects/{project_id}/policy

This is a policy-authoring surface authenticated with a project owner’s user token, not a runtime surface authenticated with a project API key.

Custom rules

Custom rules are additional regex-based detectors that run alongside the platform baseline. Each custom rule has:

  • id — lowercase alphanumeric and underscores. Must not collide with platform-baseline rule ids.
  • name — a human-readable label.
  • pattern — a regular expression compiled at validation time.
  • action — always block. Strengthening can only add blocks; it cannot author allow rules.

Custom rules are validated at write time. An invalid pattern, an id collision with a baseline rule, or any non-block action is rejected with a configuration error rather than silently dropped.

Profiles

profile selects a named strictness level for the firewall:

  • baseline — the default. Platform-baseline rules at block.
  • strict — a forward-compatible extension point for additional baseline tightening, currently equivalent to baseline at runtime.

Existing strict configurations continue to author and validate cleanly when additional strict-mode coverage ships under that profile.

Worked example

Strengthen a project with one custom rule and the strict profile:

curl -X PATCH https://api.keelapi.com/v1/projects/<project_id>/policy \ -H "Authorization: Bearer <user_token>" \ -H "Content-Type: application/json" \ -d '{ "prompt_firewall_strengthening": { "profile": "strict", "custom_rules": [ { "id": "internal_codename", "name": "Block internal codename", "pattern": "(?i)project\\s+sunrise", "action": "block" } ] } }'

A subsequent execution that contains project sunrise in any inspected field is blocked. The deny permit carries the matched rule:

{ "decision": "deny", "reason_code": "prompt_firewall_blocked", "policy_id": "prompt_firewall_v1", "policy_version": "1.0.0", "deny_details": { "matched_rule_ids": ["internal_codename"], "field_path": "messages[0].content", "pattern_ids": ["internal_codename"], "occurrence_counts": {"internal_codename": 1} } }

Custom rules and platform baseline rules are evaluated in the same pass. A custom-rule match produces the same deny outcome shape and the same deny-permit record as a platform-rule match.

Decision behavior on a firewall block

When a firewall rule matches, Keel persists a deny permit and stops the request before any provider call is made. The deny permit carries:

  • reason_code = "prompt_firewall_blocked"
  • policy_id = "prompt_firewall_v1"
  • policy_version = "1.0.0"
  • a deny_details object with matched_rule_ids, field_path, pattern_ids, and per-rule occurrence_counts

No provider dispatch occurs. The execution route also fails closed if blocked firewall state somehow reaches the dispatch boundary, so a firewall block is a hard stop regardless of where the block was detected in the route.

When the firewall passes — meaning no configured rule matched the inspected text — Keel does not persist a separate firewall-pass record. Allow matches are logged but do not produce a replayable success event. Only blocks appear in Timeline Replay as standalone firewall events.

prompt_firewall_blocked is part of the firewall denial reason-code system, distinct from the permit reason-code lexicon documented in Errors › Permit reason codes. See Errors for per-surface error envelope shapes and how firewall denials surface across permit, executions, execute, and proxy routes.

Operational boundaries

A few operational details worth knowing when reasoning about firewall behavior in production:

  • Idempotency replay. Successful idempotency replays on replay-capable proxy routes can return without a fresh firewall pass, because the replayed response is the prior response. POST /v1/executions and POST /v1/execute use a different reservation path from POST /v1/proxy/*.
  • Allow events are not persisted. Only firewall blocks are persisted as standalone events. A passing inspection does not produce a separate evidentiary record beyond the permit and the execution record themselves.
  • Async submission is deferred. POST /v1/jobs does not run the firewall on submission. The firewall runs in the worker execution path when the job dispatches.

What this surface does and does not claim

  • The firewall is an inbound, deterministic detection layer for well-defined risk patterns. It is not a semantic prompt-injection defense and does not score open-ended adversarial content.
  • The firewall does not OCR images, transcribe audio or video, or analyze pixels. Non-text media is only covered when upstream callers already supply extracted text fields.
  • The firewall does not score model outputs or scan provider responses.
  • Route coverage is explicit, not universal. Routes outside the per-surface matrix above should be treated as not running firewall inspection.
  • The firewall is one defense layer. It does not replace the project policy engine, output filtering on your side, or provider-side safety controls.

For the broader content-inspection scope and how the firewall fits into the full non-claim model, see Scope and Limits § Content inspection.

Plan tier availability

CapabilityPlan
Platform-baseline rules at block on every supported routeEvery plan
Strengthening — custom rules and profilesBusiness and Enterprise

Below the strengthening gate, projects still receive full platform-baseline protection. The gate affects custom-rule authoring, not baseline coverage.

For the full per-tier feature matrix, see Plans & Entitlements.

  • Security — the broader security boundary
  • Errors — error envelope shape and firewall denial reason codes
  • Permits — the deny-permit shape that firewall blocks emit
  • Executions — the execution surface that runs the firewall
  • Plans & Entitlements — full plan-tier entitlement matrix
Last updated on Edit this page on GitHub