PII Detection in Python: Regex vs NER vs Middleware Compared

April 4, 2026 · 9 min read

If you're sending user data to an LLM, storing free-text logs, or building anything that touches personal information, you need a strategy for PII detection. In Python, you have three realistic approaches: regular expressions, named entity recognition (NER), or a dedicated middleware layer.

Each one has real tradeoffs. This post walks through all three with working code, covers where each breaks down, and gives you a framework for choosing the right tool for your situation.

Why PII Detection Matters Right Now

Regulations are specific and the penalties are real. Under GDPR Article 5(1)(c), data minimization is a core principle: you should only process personal data that is adequate, relevant, and limited to what is necessary. Sending a user's full name, email, and medical history to a third-party API when you only need a sentiment score is a direct violation of that principle.

HIPAA's Privacy Rule (45 CFR §164.514) defines 18 categories of protected health information (PHI) that must be removed before data is considered de-identified. These include names, geographic data smaller than a state, dates (other than year) for ages over 89, phone numbers, email addresses, and IP addresses, among others.

The EU AI Act, which began applying to high-risk AI systems in August 2026, adds further obligations around data governance for systems used in sensitive contexts like hiring, credit, and healthcare. If your Python application feeds data into one of those pipelines, detect and redact PII text before it ever leaves your environment.

Approach 1: Regex-Based PII Detection in Python

Regex is the fastest approach to get running. It works well for structured PII: email addresses, phone numbers, Social Security Numbers, credit card numbers, IP addresses, and similar patterns.

import re

PII_PATTERNS = {
    "email": r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+",
    "phone": r"\b(\+?1[\s\-.]?)?\(?\d{3}\)?[\s\-.]?\d{3}[\s\-.]?\d{4}\b",
    "ssn": r"\b\d{3}[-\s]?\d{2}[-\s]?\d{4}\b",
    "credit_card": r"\b(?:\d[ \-]?){13,16}\b",
    "ip_address": r"\b(?:\d{1,3}\.){3}\d{1,3}\b",
}

def detect_pii_regex(text):
    findings = []
    for label, pattern in PII_PATTERNS.items():
        for match in re.finditer(pattern, text):
            findings.append({
                "type": label,
                "value": match.group(),
                "start": match.start(),
                "end": match.end(),
            })
    return findings

def redact_pii_regex(text):
    for label, pattern in PII_PATTERNS.items():
        text = re.sub(pattern, f"[{label.upper()}]", text)
    return text

sample = "Contact jane.doe@example.com or call 415-555-0192 before 5pm."
print(redact_pii_regex(sample))
# Contact [EMAIL] or call [PHONE] before 5pm.

Regex is deterministic, fast, and requires no model loading. It is also brittle. It will not catch "my friend Sarah lives in Austin" as a person-location pair. It cannot detect names, organizations, or contextual references. And edge cases in phone number formatting will trip you up constantly.

Use regex when: You're dealing with well-structured fields, you need zero latency overhead, or you're writing a quick pre-filter before a heavier pipeline.

Approach 2: NER-Based PII Detection in Python

Named Entity Recognition models can detect unstructured PII that regex cannot. Person names, organization names, locations, dates, and more can all be tagged by a trained NER model. The two most common options in Python are spaCy and Microsoft's Presidio (which uses spaCy under the hood).

Using spaCy directly

import spacy

nlp = spacy.load("en_core_web_lg")

def detect_pii_ner(text):
    doc = nlp(text)
    findings = []
    for ent in doc.ents:
        if ent.label_ in ("PERSON", "ORG", "GPE", "DATE", "CARDINAL"):
            findings.append({
                "type": ent.label_,
                "value": ent.text,
                "start": ent.start_char,
                "end": ent.end_char,
            })
    return findings

sample = "Dr. Emily Chen from Acme Corp submitted her report on March 3rd."
for f in detect_pii_ner(sample):
    print(f)

Using Presidio for PII redaction in Python

Presidio combines NER with pattern recognizers and gives you a cleaner API for PII redaction. It is the most complete open-source option for this use case.

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = "Call John Smith at john@example.com or (800) 555-1234."

results = analyzer.analyze(text=text, language="en")
redacted = anonymizer.anonymize(text=text, analyzer_results=results)

print(redacted.text)
# Call <PERSON> at <EMAIL_ADDRESS> or <PHONE_NUMBER>.

Presidio gives you confidence scores, supports custom recognizers, and handles a wide range of entity types. The tradeoff is model loading time (several seconds on cold start) and memory footprint. Running en_core_web_lg takes around 750MB of RAM. That matters if you're deploying in a constrained environment or handling high-throughput workloads.

Presidio vs regex is not really a competition. They solve different parts of the problem. Most production systems use both: regex for structured patterns, NER for everything else.

Approach 3: Middleware for PII Detection

Both regex and NER require you to own the detection pipeline. You handle model versioning, false positive tuning, entity type coverage, and ongoing maintenance. For teams that want to detect PII text without building and operating that stack, a middleware layer is the practical alternative.

The pattern is simple: your application sends text to a redaction endpoint before forwarding it to any downstream service. The middleware strips or replaces PII and returns clean text. You get detection plus redaction in one step, with no models to manage.

Here is what that looks like with curl:

curl -X POST https://veil-api.com/v1/redact \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Please contact Sarah at sarah@example.com or 212-555-0198.",
    "entities": ["EMAIL", "PHONE", "PERSON"]
  }'

And the equivalent in Python using requests:

import requests

response = requests.post(
    "https://veil-api.com/v1/redact",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "text": "Please contact Sarah at sarah@example.com or 212-555-0198.",
        "entities": ["EMAIL", "PHONE", "PERSON"],
    },
)

data = response.json()
print(data["redacted_text"])
# Please contact [PERSON] at [EMAIL] or [PHONE].

The advantage here is that you don't run spaCy in your application process, you don't manage model updates, and you get a consistent API contract regardless of what entity types you need to detect.

The consideration is that your text does leave your process, so you need to trust the middleware provider and understand their data handling guarantees. Veil, for example, does not log or store the text it processes.

Comparison: Which Approach Should You Use?

Approach Structured PII Unstructured PII Latency Maintenance
Regex Good Poor Very low Low
spaCy NER Moderate Good Medium High
Presidio Good Good Medium High
Middleware API Good Good Network RTT None

For most teams building LLM-powered applications, the practical recommendation is: use regex as a first pass for high-confidence structured patterns