How to Audit Logs for PII
Grep, Elasticsearch, Automated Scanning
Find what's already in your logs before a regulator does — practical grep one-liners, Kibana queries, and a Python scanner you can run in CI.
Auditing logs for PII means actively searching existing log archives for personal data that shouldn't be there — before a breach, a GDPR audit, or a compliance review forces the issue. Most teams only think about this after an incident. This guide gives you the tools to do it proactively.
Where PII Enters Logs
Grep One-Liners
Start here — run against rotated log archives, not just today's live log. Use -r to scan a directory recursively and -l to list files rather than print every match.
# Email addresses
grep -rE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' /var/log/app/ \
--include='*.log' -l
# Auth tokens in headers or query strings
grep -rE '(Bearer|token|api[_-]?key)[=:\s][A-Za-z0-9._\-]{20,}' /var/log/app/ -l
# Visa/Mastercard credit card numbers (basic pattern)
grep -rE '\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})\b' /var/log/app/ -l
# UK National Insurance numbers
grep -rE '\b[A-Z]{2}[0-9]{6}[A-D]\b' /var/log/app/ -l
# US Social Security Numbers
grep -rE '\b[0-9]{3}-[0-9]{2}-[0-9]{4}\b' /var/log/app/ -l
# Phone numbers (loose international pattern)
grep -rE '\+?[\d\s\-().]{10,15}(?=\s|$|")' /var/log/app/ -l
# Passwords in auth error messages (adjust for your log format)
grep -rE '(password|passwd|pwd)[=:\s]["'"'"']?[^\s"'"'"']{6,}' /var/log/app/ -l
# Count matches by file
grep -rcE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' /var/log/app/ \
| grep -v ':0$' | sort -t: -k2 -rn | head -20
Elasticsearch / Kibana Queries
If your logs are in an ELK stack, use KQL in Kibana Discover or the Elasticsearch query API:
## KQL in Kibana Discover
# Email-shaped strings anywhere in the message field
message: *@*.*
# Specific field that should never contain email
error.message: *@*.*
# SQL errors (often contain query parameters)
message: "ERROR" AND message: "syntax" AND message: *@*.*
## Elasticsearch Query DSL (via API)
POST /logs-*/_search
{
"query": {
"query_string": {
"query": "*.@*.*",
"default_field": "message"
}
},
"size": 100,
"_source": ["@timestamp", "message", "service"]
}
# Find indices with high PII density — sample 1000 docs from each index
GET /logs-*/_count?q=message:*%40*.*
Python Automated Scanner
Run this in CI against a sample of production logs — flag any matches and fail the build or send an alert:
import re, sys, json
from pathlib import Path
PATTERNS = {
"email": re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'),
"bearer_token": re.compile(r'Bearer\s+[A-Za-z0-9._\-]{20,}'),
"visa_card": re.compile(r'\b4[0-9]{12}(?:[0-9]{3})?\b'),
"ssn": re.compile(r'\b\d{3}-\d{2}-\d{4}\b'),
"api_key": re.compile(r'(api[_-]?key|secret)[=:\s][A-Za-z0-9]{16,}', re.I),
}
def scan_file(path: Path) -> list[dict]:
findings = []
with path.open("r", errors="replace") as f:
for lineno, line in enumerate(f, 1):
for pii_type, pattern in PATTERNS.items():
if pattern.search(line):
findings.append({
"file": str(path),
"line": lineno,
"type": pii_type,
"preview": line[:120].strip(),
})
return findings
if __name__ == "__main__":
log_dir = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("./logs")
all_hits = []
for log_file in log_dir.rglob("*.log"):
all_hits.extend(scan_file(log_file))
if all_hits:
print(json.dumps(all_hits, indent=2))
print(f"\n{len(all_hits)} potential PII matches found.", file=sys.stderr)
sys.exit(1) # non-zero exit fails CI
else:
print("No PII patterns detected.")
Audit Checklist
Redact PII Before Sharing Logs
Found PII in a log file you need to share? The Log Sanitizer strips emails, tokens, IPs, and card numbers entirely in your browser — nothing is uploaded.
Open Log Sanitizer →