Skip to main content
Log Security Guide

How to Audit Logs for PII
Grep, Elasticsearch, Automated Scanning

Find what's already in your logs before a regulator does — practical grep one-liners, Kibana queries, and a Python scanner you can run in CI.

9 min read·Updated May 2026

Auditing logs for PII means actively searching existing log archives for personal data that shouldn't be there — before a breach, a GDPR audit, or a compliance review forces the issue. Most teams only think about this after an incident. This guide gives you the tools to do it proactively.

Where PII Enters Logs

Auth error logging
Failed login attempts logged as: "Invalid credentials for user@example.com" — the email goes straight into the log.
SQL error messages
ORMs and query builders often echo the full query in error messages, including WHERE email = 'user@example.com' or bound parameters.
Request body logging
DEBUG-level middleware that logs the full request body captures registration forms, profile updates, and payment fields.
URL parameters
Password reset links (/reset?token=abc&email=user@example.com), email verification, and OAuth redirects log PII in the URL.
Stack traces
Exception messages that include user input — "Invalid date: 1980-13-45 for user John Smith" — propagate PII up the stack.
Third-party SDK logs
Stripe, Twilio, and analytics SDKs often emit verbose logs at DEBUG level that include phone numbers, emails, and addresses.
GraphQL query logging
Logging full GraphQL queries exposes field arguments: query { user(email: "x@x.com") { name } }.
Webhook payloads
Stripe, GitHub, and payment processors send full payloads that may include card details, billing addresses, or PII in event data.

Grep One-Liners

Start here — run against rotated log archives, not just today's live log. Use -r to scan a directory recursively and -l to list files rather than print every match.

# Email addresses
grep -rE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' /var/log/app/ \
  --include='*.log' -l

# Auth tokens in headers or query strings
grep -rE '(Bearer|token|api[_-]?key)[=:\s][A-Za-z0-9._\-]{20,}' /var/log/app/ -l

# Visa/Mastercard credit card numbers (basic pattern)
grep -rE '\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})\b' /var/log/app/ -l

# UK National Insurance numbers
grep -rE '\b[A-Z]{2}[0-9]{6}[A-D]\b' /var/log/app/ -l

# US Social Security Numbers
grep -rE '\b[0-9]{3}-[0-9]{2}-[0-9]{4}\b' /var/log/app/ -l

# Phone numbers (loose international pattern)
grep -rE '\+?[\d\s\-().]{10,15}(?=\s|$|")' /var/log/app/ -l

# Passwords in auth error messages (adjust for your log format)
grep -rE '(password|passwd|pwd)[=:\s]["'"'"']?[^\s"'"'"']{6,}' /var/log/app/ -l

# Count matches by file
grep -rcE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' /var/log/app/ \
  | grep -v ':0$' | sort -t: -k2 -rn | head -20

Elasticsearch / Kibana Queries

If your logs are in an ELK stack, use KQL in Kibana Discover or the Elasticsearch query API:

## KQL in Kibana Discover

# Email-shaped strings anywhere in the message field
message: *@*.*

# Specific field that should never contain email
error.message: *@*.*

# SQL errors (often contain query parameters)
message: "ERROR" AND message: "syntax" AND message: *@*.*

## Elasticsearch Query DSL (via API)
POST /logs-*/_search
{
  "query": {
    "query_string": {
      "query": "*.@*.*",
      "default_field": "message"
    }
  },
  "size": 100,
  "_source": ["@timestamp", "message", "service"]
}

# Find indices with high PII density — sample 1000 docs from each index
GET /logs-*/_count?q=message:*%40*.*

Python Automated Scanner

Run this in CI against a sample of production logs — flag any matches and fail the build or send an alert:

import re, sys, json
from pathlib import Path

PATTERNS = {
    "email":        re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'),
    "bearer_token": re.compile(r'Bearer\s+[A-Za-z0-9._\-]{20,}'),
    "visa_card":    re.compile(r'\b4[0-9]{12}(?:[0-9]{3})?\b'),
    "ssn":          re.compile(r'\b\d{3}-\d{2}-\d{4}\b'),
    "api_key":      re.compile(r'(api[_-]?key|secret)[=:\s][A-Za-z0-9]{16,}', re.I),
}

def scan_file(path: Path) -> list[dict]:
    findings = []
    with path.open("r", errors="replace") as f:
        for lineno, line in enumerate(f, 1):
            for pii_type, pattern in PATTERNS.items():
                if pattern.search(line):
                    findings.append({
                        "file":    str(path),
                        "line":    lineno,
                        "type":    pii_type,
                        "preview": line[:120].strip(),
                    })
    return findings

if __name__ == "__main__":
    log_dir  = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("./logs")
    all_hits = []
    for log_file in log_dir.rglob("*.log"):
        all_hits.extend(scan_file(log_file))

    if all_hits:
        print(json.dumps(all_hits, indent=2))
        print(f"\n{len(all_hits)} potential PII matches found.", file=sys.stderr)
        sys.exit(1)  # non-zero exit fails CI
    else:
        print("No PII patterns detected.")

Audit Checklist

☐ Does your auth service log the submitted username on failure?
Check login error handlers — they often log "Invalid credentials for {email}" before returning 401.
☐ Does your ORM log full queries in dev mode?
Django DEBUG=True, SQLAlchemy echo=True, ActiveRecord logger — these log every query with bound parameters. Ensure dev config doesn't ship to staging/prod.
☐ Is request body logging enabled at any log level?
Search for middleware names: morgan (Node), django-request-logging (Python), Spring's CommonsRequestLoggingFilter. If present, verify it strips sensitive fields.
☐ Are URL parameters logged in access logs?
Nginx and Apache log full URLs including query strings. A /verify?email=x or /reset?token=y endpoint leaks PII in the access log.
☐ Do your Elasticsearch indices have a field-level data retention policy?
An index lifecycle policy deletes old indices, but if PII is in a high-cardinality field it gets replicated across shards. Check replica count and snapshot policies too.
☐ Have you run the grep scan against log archives, not just live logs?
A bug fixed 6 months ago may have logged PII for months before the fix. Archive logs are often the largest exposure.

Redact PII Before Sharing Logs

Found PII in a log file you need to share? The Log Sanitizer strips emails, tokens, IPs, and card numbers entirely in your browser — nothing is uploaded.

Open Log Sanitizer →

Related