Beyond Injection Detection: A Positive-Security Prompt Firewall that Closes the Scope and PHI Gap SOTA Classifiers Miss in Healthcare

C2科学426 词约 2 分钟

Large language models embedded in autonomous agents process trusted instructions and untrusted data in one context window, leaving them open to direct and indirect prompt injection. In healthcare this is not hypothetical: a 2025 JAMA Network Open study found commercial medical LLMs followed injected instructions in 94.4% of simulated patient encounters, including life-threatening recommendations [21]. Yet the clinically decisive problem we quantify here is different. Most real clinical threats--protected health information (PHI) exfiltration, cross-patient access, bulk export, out-of-scope advice--are fluent, legitimate-looking requests that carry no attack signal, so even a state-of-the-art injection detector passes them. Existing runtime guardrails trade safety against latency: model-based auditors are accurate but add hundreds of milliseconds of Python inference, while lexical filters are fast but blind to obfuscated or semantically disguised payloads. We present QFIRE, an inline, provider-agnostic prompt firewall implemented as a single self-contained Rust toolchain (proxy, CLI, and benchmark harness). QFIRE combines three mechanisms: (i) positive-security scope constraints, which restrict a model call to a declared natural-language purpose and block out-of-scope drift even when no overt attack token is present; (ii) an asynchronous detector graph that runs N rules and their detector nodes concurrently, cheapest checks first; and (iii) a de-obfuscation pass that decodes Base64/hex/ROT13, folds homoglyphs and leetspeak, and strips zero-width characters before detection. QFIRE ships 106 versioned firewall rules and a dedicated HIPAA Safe-Harbor (18-identifier) PHI panel, and runs a local DeBERTa-v3 injection classifier via embedded ONNX Runtime. On 1,968 public prompt-injection and jailbreak prompts QFIREs deterministic hybrid attains F 1 0.86, statistically tied with Metas state-of-the-art PromptGuard-2 (0.86) and above protectai DeBERTa-v3 (0.83); lexical baselines lag (0.16-0.50). Our central result is on QFIRE-HealthBench, a new 2,000-prompt healthcare benchmark we build and release with real garak and Microsoft PyRIT [25] payloads. There the same PromptGuard-2 recovers only 0.40 recall (DeBERTa-v3 0.57), because most clinical threats carry no injection signal; QFIREs combined scope+PHI chain reaches 0.83 recall (F1 0.87) at a calibrated 0.08 false-positive rate. Generic injection detection, even state-of-the-art, is therefore necessary but not sufficient for healthcare agents. A bare LLM judge also closes most of this static-corpus gap (F1 0.90); QFIREs contribution beyond static accuracy is auditable determinism, bounded latency, and adaptive robustness, where the bare judge falls to 34-59% recall ([§]5.5). End-to-end, placing QFIRE in front of a tool-using agent over a mock-EHR sandbox cuts the agents harmful-action rate from 0.38 to 0.00 at a 0.13 benign-utility cost. All code, rules, corpora snapshots, and scripts are released, and every table regenerates from a single make paper target against local models with no paid API keys.

Schwoebel, J. et al. · CC-BY 4.0