PII, PHI, and Secrets in AI Agent Outputs: How to Detect and Block Sensitive Data Leakage in Real Time

Your AI agent doesn't know the difference between a helpful answer and a compliance violation. Social Security numbers, medical record numbers, API keys — if it's in the context, it's in the response. Here's how to stop it.

The Problem: AI Agents Treat All Data Equally

When a customer service agent summarizes a Jira ticket, it summarizes everything in the ticket — including the customer's Social Security number that someone pasted into the description field. When a healthcare agent retrieves patient records to answer a clinician's question, it may include the Medical Record Number, ICD-10 codes, and prescription details in its response. When a developer assistant reads a configuration file, it might surface the AWS access key that was committed alongside the actual config values.

AI models have no native concept of data sensitivity. To a language model, a Social Security number is just a string of digits. An API key is just a token. A patient's diagnosis code is just context for a better answer. The model optimizes for helpfulness, and the most helpful response often includes the most specific data available — which is exactly the data that should never leave the system.

This isn't a theoretical risk. Sensitive information disclosure ranks as a top vulnerability in both the OWASP LLM Top 10 (LLM06) and the OWASP Agentic Top 10. IBM's 2025 Cost of a Data Breach Report found that shadow AI breaches cost an average of $4.63 million per incident. And as agents gain access to databases, CRMs, file systems, and communication tools, every integration point becomes a potential leakage channel.

Why Prompt Instructions Don't Solve This

The instinct is to add a system prompt instruction: "Do not include sensitive information in your output." This approach fails for three reasons.

Models don't reliably follow negative instructions. Research from every major AI lab confirms that models trained for helpfulness will leak data when prompted correctly — even with explicit safety training. A sufficiently creative prompt can extract information that the system prompt told the model to withhold.

Models can't classify sensitivity. A system prompt that says "don't include PII" requires the model to know what PII looks like in every format, across every jurisdiction, in every context. Models aren't trained as data classifiers. They'll catch an obviously formatted SSN some of the time and miss it in free text most of the time.

The leakage often happens in tool outputs, not model generation. When an agent calls a database query tool or reads a file, the raw data passes through the model's context window. Even if the model's generated response omits the sensitive data, the tool output itself — visible in logs, debugging interfaces, and agent-to-agent messages — may contain it.

Sensitive data detection must happen at the infrastructure layer, not the model layer. Every message — inbound prompts, outbound responses, tool calls, and tool outputs — must be scanned before it moves to the next step.

The Four Categories of Sensitive Data AI Agents Leak

1. Personally Identifiable Information (PII)

The broadest category and the one with the most regulatory exposure. PII includes Social Security numbers, email addresses, phone numbers, credit card numbers, driver's license numbers, passport numbers, and any other data that can identify a specific individual.

PII leakage through AI agents typically happens in three scenarios. First, a user provides PII in their input (a customer pasting their SSN into a support chat), and the agent echoes it back or includes it in a summary. Second, the agent retrieves PII from a connected data source (a CRM, database, or file system) and includes it in its response. Third, PII appears in RAG-retrieved documents and flows through the model's context into its output.

The challenge with PII detection is format variability. A Social Security number might appear as "123-45-6789," "123 45 6789," or "SSN: 123456789." A phone number might be formatted with country codes, parentheses, dots, or dashes. Credit card numbers may or may not include spaces. Effective detection requires matching these patterns across all common format variations — not just the canonical format.

2. Protected Health Information (PHI)

PHI is PII's higher-stakes sibling in healthcare contexts. Under HIPAA, PHI includes any individually identifiable health information — Medical Record Numbers (MRNs), ICD-10 diagnosis codes, CPT procedure codes, National Provider Identifier (NPI) numbers, prescription details including dosages, dates of service, and any combination of health condition with patient identifier.

PHI leakage is particularly dangerous in AI agents because healthcare is one of the sectors adopting agentic AI most aggressively. Clinical decision support agents, patient communication assistants, prior authorization bots, and research summarization tools all handle PHI routinely. And HIPAA penalties for unauthorized disclosure are severe — up to $2.1 million per violation category per year, with criminal penalties for knowing violations.

PHI detection is more complex than PII detection because many PHI elements are domain-specific. An ICD-10 code like "J18.9" looks like a random alphanumeric string unless the detection system knows the ICD-10 format. An MRN is typically a facility-specific numeric identifier with no universal format. NPI numbers follow a specific 10-digit pattern with a check digit. Detection requires healthcare-specific pattern libraries, not just general PII patterns.

3. Secrets and Credentials

API keys, access tokens, private keys, JWTs, and passwords represent a different risk profile than PII or PHI. While PII leakage creates regulatory exposure, secrets leakage creates immediate operational exposure — a leaked AWS access key can be exploited within minutes by automated credential scanners.

The landscape of credential formats is vast and constantly expanding. AWS access keys follow a specific pattern (beginning with "AKIA"). GitHub tokens have their own format. OpenAI, Anthropic, and Stripe keys each have distinctive prefixes. Private keys have well-known header/footer markers. JWTs have a characteristic three-segment base64 structure. Slack tokens, Discord tokens, and generic API keys each have recognizable patterns.

Secrets leakage through AI agents happens most commonly when agents have access to configuration files, environment variables, code repositories, or deployment systems. A developer assistant that reads a .env file, a DevOps agent that queries infrastructure configuration, or a code review agent that processes commits containing hardcoded credentials can all surface secrets in their outputs.

4. Token and Credential Fragments

Beyond complete credentials, agents can leak partial credential information that, in combination, enables account compromise. The last four digits of a credit card paired with a billing zip code. A password hint combined with a username. A partial API key that narrows the brute-force space.

This category is the hardest to detect because the individual fragments may not match any known sensitive data pattern. Detection requires understanding not just what the data looks like, but what combinations of data points constitute a security risk — which is fundamentally a context problem, not a pattern-matching problem.

Architecture Principles for Sensitive Data Detection

Scan Everything, Not Just Model Output

The most common deployment mistake is scanning only the agent's final response. This misses three critical leakage channels. Tool call arguments may contain sensitive data from the user's input. Tool responses carry raw data from connected systems before the model processes it. Agent-to-agent messages in multi-agent systems pass data between contexts without any user-facing output to scan.

Effective sensitive data detection must evaluate every message in the agent pipeline — inbound, outbound, tool calls, tool responses, and inter-agent communication.

Combine Structured and Keyword Detection

Structured detection catches data that follows a known format — SSN patterns, credit card Luhn checks, ICD-10 code formats, JWT structures. Keyword detection catches contextual references that indicate sensitive data even when the format is ambiguous — terms like "diagnosis," "medical record," "prescription," "dosage," "patient ID" adjacent to data that might otherwise look innocuous.

Neither approach alone provides adequate coverage. Structured detection misses sensitive data in free text. Keyword detection generates false positives without structured validation. The combination is what produces both coverage and precision.

Apply Configurable Confidence Thresholds

Not every pattern match is equally likely to be genuine sensitive data. A nine-digit number might be an SSN — or it might be a zip code with a four-digit extension, an order number, or a random identifier. A string matching an API key pattern might be an actual credential — or it might be a test fixture, a documentation example, or a hash.

The detection system needs to produce confidence scores, and the policy layer needs to support different enforcement actions at different thresholds. High-confidence matches (a Luhn-valid credit card number, a JWT with valid structure, an AWS key with the correct prefix) warrant immediate blocking. Lower-confidence matches warrant logging or flagging for human review.

Enforce in Hardware Isolation

If sensitive data detection runs in the same process as the agent, a compromised agent can bypass or disable the detection. Evaluating sensitive data policies inside a Trusted Execution Environment ensures the detection logic is tamper-proof — the agent can't modify or circumvent the scanning, even if it's been fully compromised by a prompt injection attack.

This is particularly important for secrets detection, where the stakes of a bypass are immediate. A leaked API key that isn't caught because the agent disabled its own detection layer can be exploited before any human reviews the logs.

The Compliance Map

Sensitive data detection maps to the most heavily audited compliance requirements in enterprise security:

  • HIPAA Security Rule (§164.312): Technical safeguards for electronic PHI, including access controls, audit controls, and transmission security.
  • GDPR Articles 25 & 32: Data protection by design and appropriate technical measures to ensure security of processing.
  • PCI DSS Requirement 3: Protect stored cardholder data — applies to any agent that processes or transmits payment card information.
  • SOC 2 (CC6.1, CC6.7): Logical access controls and restrictions on data transmission to authorized parties.
  • CCPA/CPRA: Reasonable security procedures and practices to protect personal information. California's ADMT transparency requirements take effect in 2026.
  • OWASP LLM Top 10 (LLM06): Sensitive Information Disclosure — the direct mapping for this entire category.
  • NIST AI RMF (Govern, Map, Manage): Risk management functions covering AI-specific data protection.

How Spellguard Handles This

Spellguard's policy engine scans every message in the agent pipeline — inbound, outbound, tool calls, and tool responses — for PII, PHI, secrets, and credential patterns. Detection runs in real time inside a Trusted Execution Environment, ensuring the scanning can't be bypassed even by a compromised agent.

PII detection covers the highest-risk identifier categories out of the box. PHI detection is purpose-built for healthcare contexts, covering medical record numbers, diagnosis codes, procedure codes, provider identifiers, and prescription information with configurable confidence thresholds. Secrets detection spans the major cloud provider, SaaS platform, and generic credential formats, including API keys, access tokens, private keys, and JWTs.

All sensitive data policies ship enabled on the free tier. For organizations that need to add industry-specific patterns, adjust confidence thresholds, configure redaction versus blocking behavior, or route alerts to existing SIEM infrastructure, the policy SDK supports full customization.

Sign up for free to start detecting sensitive data in your agent pipelines today, or book a demo to see how Spellguard catches the PII, PHI, and secrets your current stack misses.

This is Part 3 of a 9-part series on AI agent security policies. Next up: Toxic & Harmful Content Filtering — how to detect toxicity, NSFW content, and crisis-level messages in AI agent interactions before they reach your users.

Secure, auditable
agent-to-agent communication.

Ask AI about Spellguard: