Skip to main content
Log Security Guide

What Is Log Aggregation
— and Why PII Is a Risk

Centralising logs from dozens of services is how modern teams stay observable. It's also how personal data ends up indexed, replicated, and accessible to everyone with Kibana access.

8 min read·Updated May 2026

What is log aggregation: the practice of collecting log output from every service, container, and server in your system and routing it to a single centralised store — so instead of SSH-ing into each machine and tailing files, you run one query and see everything. Tools like the ELK Stack, Splunk, Datadog, and Grafana Loki make this straightforward. The compliance problem is that logs travelling to those stores often carry user emails, session tokens, and payment data that were never meant to be indexed, retained for 90 days, and visible to the whole engineering team.

What Log Aggregation Actually Does

Before aggregation, the typical debugging experience is: something goes wrong in production, a developer SSHes into the affected server, runs tail -f /var/log/app/error.log, and tries to piece together what happened from one machine's perspective. In a microservices architecture with 20 services across 40 containers that auto-scale, this approach becomes impossible. By the time you've found the right container it may have already been destroyed.

Log aggregation solves this by installing a lightweight agent on every host (or as a sidecar in every container) that tails your log files or consumes from stdout and ships each log line to a central service. That central service indexes the data so you can search across all services simultaneously: show me every ERROR from the payment service in the last 10 minutes that also has a matching request ID from the auth service.

Beyond debugging, a central log store enables alerting (fire a PagerDuty notification when error rate crosses a threshold), dashboarding, SLA monitoring, security auditing, and capacity planning — all from a single query interface.

The critical implication for PII

Central indexing means every field in every log line is fully searchable. A user.email field that your application logs at DEBUG level isn't just written to a file on one server — it's indexed, replicated across cluster nodes, retained for your configured period, and queryable by any developer with dashboard access. That's a fundamentally different risk profile than a local file.

How the ELK Stack Works

The ELK Stack — Elasticsearch, Logstash, and Kibana — is the most widely deployed open-source log aggregation solution. Understanding its architecture makes the PII risk concrete.

Filebeat runs as a lightweight agent on each host. It tails log files or reads from journald and ships each event as JSON over a persistent connection. Filebeat is stateful: it remembers its position in each file so it doesn't re-send on restart.

Logstash is the processing layer. It receives events from Filebeat (or directly from applications), applies parsing rules (grok patterns, JSON parsing, field extraction), enriches events (add a geo-IP field from an IP address, add a hostname from the container metadata), and routes to one or more outputs.

Elasticsearch is the indexed store. Every log event is stored as a document in an index (usually one per day: logs-2026.05.02). Fields are indexed using inverted indexes — the same technology behind search engines — which is what makes full-text and field-specific queries across billions of events fast.

Kibana provides the query UI and dashboards. A developer opens Kibana, picks a time range, and runs a query like level:ERROR AND service:checkout. Kibana translates this into an Elasticsearch query and renders the matching documents.

A minimal Filebeat configuration to ship application logs looks like this:

filebeat.yml
filebeat.inputs:
  - type: filestream
    id: app-logs
    paths:
      - /var/log/myapp/*.log
    parsers:
      - ndjson:
          target: ""
          add_error_key: true

processors:
  - add_host_metadata: ~
  - add_docker_metadata: ~

output.logstash:
  hosts: ["logstash:5044"]

# Every field in every log line ships verbatim.
# If your app logs user.email at DEBUG, it arrives
# in Elasticsearch with full-text indexing.

The key thing to notice: Filebeat ships everything verbatim. There is no filtering here. Whatever your application writes to /var/log/myapp/ lands in Elasticsearch with full indexing. Logstash can apply redaction filters, but only if you configure them — and most teams don't, until after a compliance audit.

Other Common Log Aggregation Tools

ELK is one option. The ecosystem has several others, each with different cost, operational overhead, and compliance implications:

Tool Cost model Query language Self-hosted?
ELK Stack Free OSS; Elastic Cloud from ~$95/mo. Infrastructure costs dominate at scale. KQL / Lucene Yes — primary mode
Splunk Enterprise licence; expensive. Priced per GB/day ingested. Large orgs pay six figures/year. SPL (Splunk Processing Language) Yes (Splunk Enterprise)
Datadog SaaS, pay-per-GB. No infrastructure. 15-day retention default; longer costs more. Datadog search syntax No — SaaS only
Grafana Loki Free OSS; Grafana Cloud has a generous free tier. Very cheap at scale — only indexes labels, not full text. LogQL Yes — common choice

AWS CloudWatch Logs deserves a mention as a fifth option for teams already on AWS: zero infrastructure, automatic integration with Lambda, ECS, and EC2, and retention configurable per log group. Its query language (CloudWatch Logs Insights) is less powerful than SPL or full-text Elasticsearch, but for AWS-native workloads it's the path of least resistance.

From a PII perspective, the SaaS options (Datadog, Grafana Cloud) add an additional dimension: logs leave your infrastructure and are processed by a third-party's servers. This requires a signed Data Processing Agreement (DPA) and, for EU data, confirmation that data is stored within the EEA or covered by Standard Contractual Clauses.

The PII Problem in Log Aggregators

Here's the specific compliance failure mode, step by step:

  1. Your auth service logs a failed login at INFO level, including the attempted email address: INFO login_failed user=alice@example.com ip=203.0.113.42
  2. Filebeat ships this line to Logstash, which enriches it with hostname and forwards to Elasticsearch.
  3. Elasticsearch indexes the document. The user field is analysed and stored in an inverted index. alice@example.com is now instantly searchable.
  4. The document is replicated to two additional Elasticsearch nodes for redundancy.
  5. Your index lifecycle policy retains this index for 90 days before deletion.
  6. Every developer with Kibana access can now search user:alice@example.com and see every failed login attempt, from which IPs, at what times.

Under GDPR Article 5, personal data must be "adequate, relevant and limited to what is necessary" (data minimisation) and "kept in a form which permits identification of data subjects for no longer than is necessary" (storage limitation). A debug-level email address indexed in Elasticsearch for 90 days and accessible to the whole engineering team fails both requirements simultaneously.

The severity scales with log volume. An application that logs 10,000 requests per second generates roughly 860 million log events per day. If 1% of those events inadvertently include a user email, that's 8.6 million personal data records created every day — each indexed, replicated, and retained.

What Typically Lands in Logs Unintentionally

!
Auth tokens in URL parameters
Password reset links, email verification tokens, and OAuth callbacks carry tokens as query params. These appear verbatim in access logs: GET /verify?token=eyJhbGci...
!
Request bodies containing passwords
Frameworks that log incoming requests at DEBUG level capture POST bodies wholesale. A login form sends {"email":"...","password":"..."} — both fields land in the log.
!
SQL queries with literal user values
ORM debug logging and database slow-query logs record the full query text including parameter values: WHERE email = 'alice@...' AND...
!
Stack traces with email addresses
When an exception is thrown mid-operation, the stack trace captures the current local variables. If a user object was in scope, its fields appear in the traceback.
!
Headers containing session cookies
Debugging an HTTP client failure by logging the full request and response headers dumps Cookie: session=s8q3k9p2; auth=eyJ... into the log.
!
User-agent + IP address combinations
Individually neither field is catastrophic. Combined with a timestamp they form a fingerprint that uniquely identifies a device and therefore a person — which is personal data under GDPR.
!
Error messages echoing user input
Validation errors that echo the invalid value — Invalid email address: ali ce@example.com — log the user's attempted input along with context that makes it identifiable.
!
Payment webhook payloads
Stripe, PayPal, and similar providers send webhook JSON that includes billing name, email, partial card numbers, and address. Logging the raw payload "for debugging" stores all of it.

How to Fix It Before Logs Leave the Service

Sanitising at the application level — before the log line is written to stdout or a file — is the most reliable approach. By the time Filebeat ships the line to Logstash, the PII is already committed to the log file on disk. Fix the source, not the pipe.

Python: logging.Filter at the handler level

Subclassing logging.Filter and attaching it to the root logger guarantees that every log call, including those from third-party libraries you don't control, passes through your redaction logic before reaching any handler. This pattern is covered in detail in the how to redact PII in Python guide.

pii_filter.py
import logging
import re

EMAIL_RE    = re.compile(r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}')
IP_RE       = re.compile(r'\b(?:\d{1,3}\.){3}\d{1,3}\b')
TOKEN_RE    = re.compile(r'Bearer\s+[A-Za-z0-9\-_=]+\.[A-Za-z0-9\-_=]+\.[A-Za-z0-9\-_.+/=]+')

class PiiRedactFilter(logging.Filter):
    def filter(self, record: logging.LogRecord) -> bool:
        msg = record.getMessage()
        msg = EMAIL_RE.sub('[REDACTED_EMAIL]', msg)
        msg = IP_RE.sub('[REDACTED_IP]', msg)
        msg = TOKEN_RE.sub('[REDACTED_TOKEN]', msg)
        # Replace the pre-formatted message so handlers see the redacted text
        record.msg  = msg
        record.args = ()
        return True  # always allow the record through

# Attach once at startup — covers all loggers including third-party
logging.getLogger().addFilter(PiiRedactFilter())

Structured logging with an explicit field allowlist

Unstructured string logging — logger.info(f"User {user.email} logged in") — makes redaction harder because you're pattern-matching free text. Structured logging with structlog lets you log named fields explicitly, which makes it trivial to audit what goes to your aggregator:

structured_logging.py
import structlog
import hashlib

def hash_identifier(value: str) -> str:
    """One-way hash for correlation without storing the raw identifier."""
    return hashlib.sha256(value.encode()).hexdigest()[:16]

log = structlog.get_logger()

# BAD — the raw email goes into the structured event dict,
# gets indexed as a field in Elasticsearch
log.info("login_failed", email=user.email, ip=request.ip)

# GOOD — log a stable correlatable identifier, not the PII itself
log.info(
    "login_failed",
    user_hash=hash_identifier(user.email),  # correlatable, not reversible
    ip_subnet=request.ip.rsplit('.', 1)[0] + '.0',  # subnet, not host
    event_count=1,
)

# NEVER log request bodies — log field names only
log.debug(
    "request_received",
    method=request.method,
    path=request.path,
    content_type=request.content_type,
    body_fields=list(request.json.keys()),  # ["email","password"] — not values
)

How to Fix It at the Aggregator Level

Aggregator-level filtering is a safety net, not a substitute for source-level sanitisation. But it matters, because you don't control every piece of software writing logs — third-party libraries, framework internals, and vendor SDKs may log PII regardless of your own code hygiene.

Logstash: mutate/gsub filter

logstash.conf (filter section)
filter {
  mutate {
    # Redact email addresses in the message field
    gsub => [
      "message", "[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}",
      "[REDACTED_EMAIL]"
    ]
  }

  # Remove entire fields that should never reach Elasticsearch
  mutate {
    remove_field => ["[request][headers][cookie]",
                     "[request][headers][authorization]",
                     "[request][body]"]
  }
}

Datadog scrubbing rules

In the Datadog Agent configuration, logs_config.processing_rules accepts a mask_sequences rule type that redacts matched patterns before the event is sent to Datadog's servers. This is preferable to relying on Datadog's server-side sensitive data scanner, because the mask happens in the agent — on your infrastructure — before data leaves your network:

datadog.yaml
logs_config:
  processing_rules:
    - type: mask_sequences
      name: redact_emails
      replace_placeholder: "[REDACTED_EMAIL]"
      pattern: "[a-zA-Z0-9._%+\\-]+@[a-zA-Z0-9.\\-]+\\.[a-zA-Z]{2,}"
    - type: mask_sequences
      name: redact_bearer_tokens
      replace_placeholder: "[REDACTED_TOKEN]"
      pattern: "Bearer [A-Za-z0-9\\-_=]+\\.[A-Za-z0-9\\-_=]+\\.[A-Za-z0-9\\-_.+/=]+"

Splunk: field masking at index time

Splunk supports field transformations in transforms.conf that apply a SEDCMD or REGEX replacement to events as they are indexed. Like Logstash gsub and Datadog agent masking, this runs server-side after ingestion — meaning the raw event has already crossed your network boundary. For SaaS Splunk Cloud, data is already on Splunk's infrastructure by the time masking runs, which is why agent-level masking is always preferable.

Aggregator-level filtering is last resort

Logstash, Datadog, and Splunk all provide redaction capabilities. Treat them as a defence-in-depth layer, not a primary control. By the time an event reaches your aggregator, it has already been written to a log file on disk, been read by a shipping agent, and traversed your internal network. Sanitise at source.

PII in Logs Audit Checklist

Use this checklist to assess your current exposure. Each item maps to a specific, searchable risk in your aggregator:

1
Does your logger log request bodies?
Search Kibana for body: or payload: in the last 24 hours. Any hits warrant immediate investigation. Even partial bodies logged for debugging are likely to contain passwords, tokens, or personal fields.
2
Do SQL error messages include query parameter values?
Search for WHERE followed by an email pattern. Django, SQLAlchemy, and ActiveRecord debug modes log full queries with bound values. If you can't disable this in production, add a Logstash filter to strip anything after WHERE.
3
Does your auth service log tokens or session IDs?
A JWT or opaque session token in your logs is functionally equivalent to a stolen credential for whoever can query your aggregator. Search for Bearer, eyJ (JWT prefix), and your session cookie name.
4
Are there INFO-level logs in production that include email addresses?
DEBUG logs are expected to be verbose, but INFO should never contain personal identifiers. Search Kibana for level:INFO AND /@[a-z]/. Any results indicate a logging statement that needs to be changed to log a hashed identifier or nothing at all.
5
Do you have data retention policies configured on your indexes?
Open your Index Lifecycle Management (ILM) policy in Kibana. If you don't have one, all indexes accumulate indefinitely. GDPR Article 5(1)(e) requires a defined retention period — if you can't name one, you're in violation. 30–90 days for access logs, 12 months for security logs are defensible starting points.
6
Have you tested with a real PII search in Kibana?
Open Kibana's Discover view, set the time range to 90 days, and search for a real user's email address. Do you get hits? How many? From which services? This is the single most effective audit step — run it quarterly. What you find will drive the rest of your remediation.

FAQ

What is log aggregation? +

Log aggregation is the practice of collecting log output from multiple services, containers, and servers and routing it to a single centralised store — such as Elasticsearch, Splunk, or Datadog — where logs from all components can be searched together. Instead of SSH-ing into each machine and tailing files, you query one interface and see everything.

Why is PII a compliance risk in log aggregators? +

When logs containing personal data land in a centralised aggregator, that data is indexed, replicated across cluster nodes, and accessible to every developer with query access. A single user.email logged at DEBUG level sits in Elasticsearch for your entire retention period — typically 30–90 days — without a specific lawful basis for being there. Under GDPR Article 5 this violates both the data minimisation and storage limitation principles simultaneously.

Is Grafana Loki safer for PII than Elasticsearch? +

Loki is cheaper to operate because it only indexes labels, not the full log line text. But this doesn't reduce your PII risk — the raw log lines are still stored and accessible via log streaming queries. The difference is search performance: Loki makes it slower to find PII in raw text (no inverted index), but it's still findable. You still need to sanitise at source.

Can I use Logstash filters to remove all PII from my logs? +

Logstash gsub filters can catch structured PII like email addresses and known-format tokens. But PII in free text — a name embedded in an error message, an address echoed in a validation error — won't be caught by regex. Logstash filters are a useful defence-in-depth layer, but the primary fix must be at the application level: don't log the PII in the first place.

Sanitize Logs Before They Leave Your Machine

Remove emails, IPs, tokens and API keys from log snippets instantly — free, runs entirely in your browser, nothing uploaded.

Open Log Sanitizer — Free →

Related