How to Redact PII in Python
Regex, Presidio and Logging Filters
Three practical tiers for removing personal data from Python applications — from a fast compiled regex pipeline to NLP-based detection to logging infrastructure that redacts before anything is written.
How to redact PII in Python: the fastest approach is a compiled regex pipeline that replaces emails, IPs, phone numbers and API keys in a single pass — but for free-text fields where PII appears without a predictable format, Microsoft Presidio adds NLP-based detection that catches names, addresses and credit card numbers that patterns miss.
The Three-Tier Approach
No single technique covers all PII. The right architecture layers three complementary approaches depending on where the data originates and how structured it is.
Fast, zero dependencies. Handles predictable formats: emails, IPv4, phone numbers, Bearer tokens, JWTs, credit card patterns.
Best for: structured log fields, API responses, config values
NLP-based entity recognition. Detects names, addresses and free-text PII that regex cannot match without context.
Best for: user-supplied text, form submissions, free-text fields
Intercepts every log record before it reaches any handler. Redacts at the source — no matter where in the codebase the log call originates.
Best for: third-party library output, Django/SQLAlchemy logging
Tier 1 — Regex Redaction
Regex is the right first layer. Compile all patterns once at class instantiation — never inside the method that runs per log line — and apply them in a single sequential pass. The class below covers the six most common PII types found in Python application logs.
import re
import json
from typing import Any
class PiiRedactor:
"""Compiled regex pipeline for common PII types."""
_PATTERNS = [
# Email addresses (also catches URL-encoded %40 variants)
(re.compile(
r'[a-zA-Z0-9._%+\-]+(?:%40|@)[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}',
re.IGNORECASE
), '[EMAIL_REDACTED]'),
# IPv4 addresses
(re.compile(
r'\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}'
r'(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b'
), '[IP_REDACTED]'),
# US phone numbers — (555) 867-5309, 555-867-5309, +15558675309
(re.compile(
r'(?:\+?1[\s.\-]?)?'
r'(?:\(?\d{3}\)?[\s.\-]?)'
r'\d{3}[\s.\-]?\d{4}\b'
), '[PHONE_REDACTED]'),
# Bearer / Authorization tokens
(re.compile(
r'(?i)(Bearer\s+)[A-Za-z0-9\-._~+/]+=*'
), r'\1[TOKEN_REDACTED]'),
# Credit card numbers (13–16 digits, optional separators)
(re.compile(
r'\b(?:\d[ \-]?){13,15}\d\b'
), '[CARD_REDACTED]'),
# JWTs (three base64url segments separated by dots)
(re.compile(
r'eyJ[A-Za-z0-9\-_]+\.[A-Za-z0-9\-_]+\.[A-Za-z0-9\-_]*'
), '[JWT_REDACTED]'),
]
def redact(self, text: str) -> str:
"""Apply all patterns to a single string in one pass."""
if not isinstance(text, str):
return text
for pattern, replacement in self._PATTERNS:
text = pattern.sub(replacement, text)
return text
def redact_dict(self, obj: Any) -> Any:
"""Recursively redact string values inside dicts, lists and JSON strings."""
if isinstance(obj, str):
# Attempt to parse embedded JSON strings
try:
parsed = json.loads(obj)
return json.dumps(self.redact_dict(parsed))
except (json.JSONDecodeError, TypeError):
return self.redact(obj)
if isinstance(obj, dict):
return {k: self.redact_dict(v) for k, v in obj.items()}
if isinstance(obj, list):
return [self.redact_dict(item) for item in obj]
return obj
# Module-level singleton — import and reuse across the codebase
redactor = PiiRedactor()
Using the redactor:
from pii_redactor import redactor
# Plain string
clean = redactor.redact("User john@example.com signed in from 203.0.113.42")
# "User [EMAIL_REDACTED] signed in from [IP_REDACTED]"
# Nested dict (e.g. a parsed request body)
payload = {
"user": "jane@corp.com",
"ip": "198.51.100.7",
"meta": {"phone": "555-867-5309", "card": "4111 1111 1111 1111"}
}
clean_payload = redactor.redact_dict(payload)
# {"user": "[EMAIL_REDACTED]", "ip": "[IP_REDACTED]",
# "meta": {"phone": "[PHONE_REDACTED]", "card": "[CARD_REDACTED]"}}
Tier 2 — Microsoft Presidio
Regex fails on names, addresses and any PII without a structural pattern. Presidio uses spaCy's NLP pipeline under the hood to understand context — so it recognises "John Smith" as a person entity even mid-sentence. Install both packages:
pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
# Entity types Presidio supports out of the box
ENTITIES = [
"PERSON",
"EMAIL_ADDRESS",
"PHONE_NUMBER",
"CREDIT_CARD",
"IP_ADDRESS",
"US_SSN",
"LOCATION",
]
def presidio_redact(text: str) -> str:
results = analyzer.analyze(text=text, entities=ENTITIES, language="en")
if not results:
return text
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
return anonymized.text
# Example
raw = "Dr. Sarah Connor, SSN 078-05-1120, lives at 742 Evergreen Terrace."
print(presidio_redact(raw))
# "Dr. <PERSON>, SSN <US_SSN>, lives at <LOCATION>."
| Approach | Speed | Dependencies | Best For |
|---|---|---|---|
| Regex | Very fast (~1M lines/s) | None (stdlib only) | Emails, IPs, tokens, credit cards |
| Presidio | Slower (~100–500 lines/s) | spaCy model (~50 MB) | Names, addresses, free text, SSNs |
In production, run the regex pipeline first (fast, no false positives on structured data), then pass only free-text fields through Presidio. Never run Presidio on every log line in a hot path.
Tier 3 — logging.Filter
The previous two tiers require you to call the redactor explicitly. A logging.Filter removes that requirement entirely — it intercepts every log record at the handler level, regardless of where in the codebase the logger.info() call came from. This is the only reliable way to redact output from third-party libraries you don't control.
import logging
from pii_redactor import redactor
class LogRedactor(logging.Filter):
"""
Redacts PII from log records before any handler writes them.
Attach to the root logger to cover all output, or to a specific
handler for more surgical control.
"""
def filter(self, record: logging.LogRecord) -> bool:
# Redact the formatted message
record.msg = redactor.redact(str(record.msg))
# Redact any positional args embedded in the record
if record.args:
if isinstance(record.args, dict):
record.args = redactor.redact_dict(record.args)
else:
record.args = tuple(
redactor.redact(str(a)) if isinstance(a, str) else a
for a in record.args
)
return True # always allow the record through after redaction
# --- Attach to the root logger (covers ALL loggers in the process) ---
root_logger = logging.getLogger()
root_logger.addFilter(LogRedactor())
# --- Or attach to a specific handler only ---
file_handler = logging.FileHandler("app.log")
file_handler.addFilter(LogRedactor())
app_logger = logging.getLogger("myapp")
app_logger.addHandler(file_handler)
Attaching to the root logger is the safest default for production services. Attaching to a specific handler lets you keep unredacted output on a local debug stream while ensuring the file or remote handler never receives raw PII.
Combining All Three in a Production Pipeline
Here is how the three tiers compose into a single application setup:
import logging
from pii_redactor import redactor
from presidio_redactor import presidio_redact
from log_redactor_filter import LogRedactor
# 1. Attach Tier 3 filter to root logger at startup
logging.getLogger().addFilter(LogRedactor())
# 2. Use Tier 1 for structured log fields and API payloads
def log_request(payload: dict) -> None:
safe = redactor.redact_dict(payload)
logging.info("Incoming request: %s", safe)
# 3. Use Tier 2 for user-supplied free text before storing or logging it
def process_user_note(note: str) -> str:
return presidio_redact(note)
# Usage
log_request({"email": "user@example.com", "ip": "203.0.113.5", "body": "hello"})
# Logged as: Incoming request: {'email': '[EMAIL_REDACTED]',
# 'ip': '[IP_REDACTED]', 'body': 'hello'}
clean_note = process_user_note("Please contact Jane Doe at 555-123-4567.")
# "Please contact <PERSON> at <PHONE_NUMBER>."
Testing Your Redactor
Redaction code must be tested. Two classes of assertions matter: PII is gone, and the non-PII structure is preserved (timestamps, log levels, message context all intact).
import pytest
from pii_redactor import PiiRedactor
@pytest.fixture
def r():
return PiiRedactor()
# --- PII is removed ---
def test_email_redacted(r):
assert "[EMAIL_REDACTED]" in r.redact("Contact alice@example.com for help")
assert "alice@example.com" not in r.redact("Contact alice@example.com for help")
def test_url_encoded_email_redacted(r):
# %40 variant must also be caught
assert "[EMAIL_REDACTED]" in r.redact("user%40example.com")
def test_ip_redacted(r):
assert "[IP_REDACTED]" in r.redact("Request from 192.168.0.1 at 12:00")
assert "192.168.0.1" not in r.redact("Request from 192.168.0.1 at 12:00")
def test_phone_redacted(r):
assert "[PHONE_REDACTED]" in r.redact("Call us at (555) 867-5309 anytime")
def test_bearer_token_redacted(r):
raw = "Authorization: Bearer eyABC123def456"
result = r.redact(raw)
assert "eyABC123def456" not in result
assert "Bearer" in result # label preserved, value gone
def test_jwt_redacted(r):
jwt = "eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJ1c2VyIn0.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c"
assert "[JWT_REDACTED]" in r.redact(jwt)
def test_credit_card_redacted(r):
assert "[CARD_REDACTED]" in r.redact("Charged card 4111 1111 1111 1111")
# --- Structure is preserved ---
def test_log_level_preserved(r):
line = "2026-05-02 INFO User alice@corp.com logged in"
result = r.redact(line)
assert "2026-05-02" in result
assert "INFO" in result
assert "logged in" in result
def test_nested_dict_redacted(r):
d = {"email": "x@y.com", "nested": {"ip": "1.2.3.4"}}
clean = r.redact_dict(d)
assert clean["email"] == "[EMAIL_REDACTED]"
assert clean["nested"]["ip"] == "[IP_REDACTED]"
def test_json_string_in_dict_redacted(r):
import json
payload = json.dumps({"user": "bob@example.org"})
result = r.redact_dict(payload)
parsed = json.loads(result)
assert parsed["user"] == "[EMAIL_REDACTED]"
Common Mistakes
Calling redactor.redact() on a string you then pass to logger.info() does nothing if there is already a logging.Filter or handler that captures the pre-formatted message. Redact the value before constructing the log message, and also attach a filter as a safety net.
logger.info("request: %s", payload) serialises the dict to a string during formatting. By that point a simple string.replace() approach fails on nested keys. Use redact_dict() before passing to the logger, and ensure your Filter also handles record.args.
Emails in query strings or URL-encoded request bodies appear as user%40example.com. A pattern that only matches @ will miss these. The regex in Tier 1 above handles both variants explicitly.
Redacting what appears in the log line is not enough if the same data is also written to a structured JSON field, a metrics label or a database audit column. Apply redaction to every sink: logs, metrics, event tracking and storage.
Test Your Patterns in the Browser — Free
Paste a sample log line into the Log Sanitizer to verify your regex patterns catch the PII you expect. Runs entirely in your browser — nothing uploaded.
Open Log Sanitizer — Free →FAQ
When should I use regex vs Microsoft Presidio for PII redaction in Python? +
Use regex when PII appears in a predictable machine-readable format — email addresses, IP addresses, phone numbers, API keys, JWTs. Regex is fast, has zero dependencies and handles millions of log lines per second. Use Presidio when PII appears in free text written by humans — names, physical addresses, medical descriptions — where there is no reliable structural pattern to match against.
What PII does regex miss that Presidio catches? +
Regex cannot reliably detect person names (too many false positives), street addresses in varied formats, organisation names, or medical terms that are contextually sensitive. Presidio uses a spaCy NLP pipeline to understand context, so it can identify "John Smith" as a PERSON entity even when it appears inline in a sentence without any structural marker.
How do I redact PII from nested JSON or dict fields in Python logs? +
Implement a recursive redact_dict() method that walks dict keys and values, applying your regex redactor to every string value and recursing into nested dicts and lists. For JSON strings serialised inside log fields, detect them with a json.loads() try/except, redact the parsed object, then re-serialise. This catches PII that would survive a plain text search of the raw log line.
When should I use a logging.Filter instead of redacting at the call site? +
Use logging.Filter when you cannot control every logger.info() or logger.error() call — for example, in third-party library code, Django request logging, or SQLAlchemy query logging. Attaching a filter to the root logger guarantees that all output through any handler is redacted before it reaches any file, stream or remote sink, regardless of where in the codebase the log call originated.