What Is Log Aggregation
— and Why PII Is a Risk
Centralising logs from dozens of services is how modern teams stay observable. It's also how personal data ends up indexed, replicated, and accessible to everyone with Kibana access.
What is log aggregation: the practice of collecting log output from every service, container, and server in your system and routing it to a single centralised store — so instead of SSH-ing into each machine and tailing files, you run one query and see everything. Tools like the ELK Stack, Splunk, Datadog, and Grafana Loki make this straightforward. The compliance problem is that logs travelling to those stores often carry user emails, session tokens, and payment data that were never meant to be indexed, retained for 90 days, and visible to the whole engineering team.
What Log Aggregation Actually Does
Before aggregation, the typical debugging experience is: something goes wrong in production, a developer SSHes into the affected server, runs tail -f /var/log/app/error.log, and tries to piece together what happened from one machine's perspective. In a microservices architecture with 20 services across 40 containers that auto-scale, this approach becomes impossible. By the time you've found the right container it may have already been destroyed.
Log aggregation solves this by installing a lightweight agent on every host (or as a sidecar in every container) that tails your log files or consumes from stdout and ships each log line to a central service. That central service indexes the data so you can search across all services simultaneously: show me every ERROR from the payment service in the last 10 minutes that also has a matching request ID from the auth service.
Beyond debugging, a central log store enables alerting (fire a PagerDuty notification when error rate crosses a threshold), dashboarding, SLA monitoring, security auditing, and capacity planning — all from a single query interface.
The critical implication for PII
Central indexing means every field in every log line is fully searchable. A user.email field that your application logs at DEBUG level isn't just written to a file on one server — it's indexed, replicated across cluster nodes, retained for your configured period, and queryable by any developer with dashboard access. That's a fundamentally different risk profile than a local file.
How the ELK Stack Works
The ELK Stack — Elasticsearch, Logstash, and Kibana — is the most widely deployed open-source log aggregation solution. Understanding its architecture makes the PII risk concrete.
Filebeat runs as a lightweight agent on each host. It tails log files or reads from journald and ships each event as JSON over a persistent connection. Filebeat is stateful: it remembers its position in each file so it doesn't re-send on restart.
Logstash is the processing layer. It receives events from Filebeat (or directly from applications), applies parsing rules (grok patterns, JSON parsing, field extraction), enriches events (add a geo-IP field from an IP address, add a hostname from the container metadata), and routes to one or more outputs.
Elasticsearch is the indexed store. Every log event is stored as a document in an index (usually one per day: logs-2026.05.02). Fields are indexed using inverted indexes — the same technology behind search engines — which is what makes full-text and field-specific queries across billions of events fast.
Kibana provides the query UI and dashboards. A developer opens Kibana, picks a time range, and runs a query like level:ERROR AND service:checkout. Kibana translates this into an Elasticsearch query and renders the matching documents.
A minimal Filebeat configuration to ship application logs looks like this:
filebeat.inputs:
- type: filestream
id: app-logs
paths:
- /var/log/myapp/*.log
parsers:
- ndjson:
target: ""
add_error_key: true
processors:
- add_host_metadata: ~
- add_docker_metadata: ~
output.logstash:
hosts: ["logstash:5044"]
# Every field in every log line ships verbatim.
# If your app logs user.email at DEBUG, it arrives
# in Elasticsearch with full-text indexing.
The key thing to notice: Filebeat ships everything verbatim. There is no filtering here. Whatever your application writes to /var/log/myapp/ lands in Elasticsearch with full indexing. Logstash can apply redaction filters, but only if you configure them — and most teams don't, until after a compliance audit.
Other Common Log Aggregation Tools
ELK is one option. The ecosystem has several others, each with different cost, operational overhead, and compliance implications:
| Tool | Cost model | Query language | Self-hosted? |
|---|---|---|---|
| ELK Stack | Free OSS; Elastic Cloud from ~$95/mo. Infrastructure costs dominate at scale. | KQL / Lucene | Yes — primary mode |
| Splunk | Enterprise licence; expensive. Priced per GB/day ingested. Large orgs pay six figures/year. | SPL (Splunk Processing Language) | Yes (Splunk Enterprise) |
| Datadog | SaaS, pay-per-GB. No infrastructure. 15-day retention default; longer costs more. | Datadog search syntax | No — SaaS only |
| Grafana Loki | Free OSS; Grafana Cloud has a generous free tier. Very cheap at scale — only indexes labels, not full text. | LogQL | Yes — common choice |
AWS CloudWatch Logs deserves a mention as a fifth option for teams already on AWS: zero infrastructure, automatic integration with Lambda, ECS, and EC2, and retention configurable per log group. Its query language (CloudWatch Logs Insights) is less powerful than SPL or full-text Elasticsearch, but for AWS-native workloads it's the path of least resistance.
From a PII perspective, the SaaS options (Datadog, Grafana Cloud) add an additional dimension: logs leave your infrastructure and are processed by a third-party's servers. This requires a signed Data Processing Agreement (DPA) and, for EU data, confirmation that data is stored within the EEA or covered by Standard Contractual Clauses.
The PII Problem in Log Aggregators
Here's the specific compliance failure mode, step by step:
- Your auth service logs a failed login at INFO level, including the attempted email address:
INFO login_failed user=alice@example.com ip=203.0.113.42 - Filebeat ships this line to Logstash, which enriches it with hostname and forwards to Elasticsearch.
- Elasticsearch indexes the document. The
userfield is analysed and stored in an inverted index.alice@example.comis now instantly searchable. - The document is replicated to two additional Elasticsearch nodes for redundancy.
- Your index lifecycle policy retains this index for 90 days before deletion.
- Every developer with Kibana access can now search
user:alice@example.comand see every failed login attempt, from which IPs, at what times.
Under GDPR Article 5, personal data must be "adequate, relevant and limited to what is necessary" (data minimisation) and "kept in a form which permits identification of data subjects for no longer than is necessary" (storage limitation). A debug-level email address indexed in Elasticsearch for 90 days and accessible to the whole engineering team fails both requirements simultaneously.
The severity scales with log volume. An application that logs 10,000 requests per second generates roughly 860 million log events per day. If 1% of those events inadvertently include a user email, that's 8.6 million personal data records created every day — each indexed, replicated, and retained.
What Typically Lands in Logs Unintentionally
GET /verify?token=eyJhbGci...{"email":"...","password":"..."} — both fields land in the log.WHERE email = 'alice@...' AND...user object was in scope, its fields appear in the traceback.Cookie: session=s8q3k9p2; auth=eyJ... into the log.Invalid email address: ali ce@example.com — log the user's attempted input along with context that makes it identifiable.How to Fix It Before Logs Leave the Service
Sanitising at the application level — before the log line is written to stdout or a file — is the most reliable approach. By the time Filebeat ships the line to Logstash, the PII is already committed to the log file on disk. Fix the source, not the pipe.
Python: logging.Filter at the handler level
Subclassing logging.Filter and attaching it to the root logger guarantees that every log call, including those from third-party libraries you don't control, passes through your redaction logic before reaching any handler. This pattern is covered in detail in the how to redact PII in Python guide.
import logging
import re
EMAIL_RE = re.compile(r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}')
IP_RE = re.compile(r'\b(?:\d{1,3}\.){3}\d{1,3}\b')
TOKEN_RE = re.compile(r'Bearer\s+[A-Za-z0-9\-_=]+\.[A-Za-z0-9\-_=]+\.[A-Za-z0-9\-_.+/=]+')
class PiiRedactFilter(logging.Filter):
def filter(self, record: logging.LogRecord) -> bool:
msg = record.getMessage()
msg = EMAIL_RE.sub('[REDACTED_EMAIL]', msg)
msg = IP_RE.sub('[REDACTED_IP]', msg)
msg = TOKEN_RE.sub('[REDACTED_TOKEN]', msg)
# Replace the pre-formatted message so handlers see the redacted text
record.msg = msg
record.args = ()
return True # always allow the record through
# Attach once at startup — covers all loggers including third-party
logging.getLogger().addFilter(PiiRedactFilter())
Structured logging with an explicit field allowlist
Unstructured string logging — logger.info(f"User {user.email} logged in") — makes redaction harder because you're pattern-matching free text. Structured logging with structlog lets you log named fields explicitly, which makes it trivial to audit what goes to your aggregator:
import structlog
import hashlib
def hash_identifier(value: str) -> str:
"""One-way hash for correlation without storing the raw identifier."""
return hashlib.sha256(value.encode()).hexdigest()[:16]
log = structlog.get_logger()
# BAD — the raw email goes into the structured event dict,
# gets indexed as a field in Elasticsearch
log.info("login_failed", email=user.email, ip=request.ip)
# GOOD — log a stable correlatable identifier, not the PII itself
log.info(
"login_failed",
user_hash=hash_identifier(user.email), # correlatable, not reversible
ip_subnet=request.ip.rsplit('.', 1)[0] + '.0', # subnet, not host
event_count=1,
)
# NEVER log request bodies — log field names only
log.debug(
"request_received",
method=request.method,
path=request.path,
content_type=request.content_type,
body_fields=list(request.json.keys()), # ["email","password"] — not values
)
How to Fix It at the Aggregator Level
Aggregator-level filtering is a safety net, not a substitute for source-level sanitisation. But it matters, because you don't control every piece of software writing logs — third-party libraries, framework internals, and vendor SDKs may log PII regardless of your own code hygiene.
Logstash: mutate/gsub filter
filter {
mutate {
# Redact email addresses in the message field
gsub => [
"message", "[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}",
"[REDACTED_EMAIL]"
]
}
# Remove entire fields that should never reach Elasticsearch
mutate {
remove_field => ["[request][headers][cookie]",
"[request][headers][authorization]",
"[request][body]"]
}
}
Datadog scrubbing rules
In the Datadog Agent configuration, logs_config.processing_rules accepts a mask_sequences rule type that redacts matched patterns before the event is sent to Datadog's servers. This is preferable to relying on Datadog's server-side sensitive data scanner, because the mask happens in the agent — on your infrastructure — before data leaves your network:
logs_config:
processing_rules:
- type: mask_sequences
name: redact_emails
replace_placeholder: "[REDACTED_EMAIL]"
pattern: "[a-zA-Z0-9._%+\\-]+@[a-zA-Z0-9.\\-]+\\.[a-zA-Z]{2,}"
- type: mask_sequences
name: redact_bearer_tokens
replace_placeholder: "[REDACTED_TOKEN]"
pattern: "Bearer [A-Za-z0-9\\-_=]+\\.[A-Za-z0-9\\-_=]+\\.[A-Za-z0-9\\-_.+/=]+"
Splunk: field masking at index time
Splunk supports field transformations in transforms.conf that apply a SEDCMD or REGEX replacement to events as they are indexed. Like Logstash gsub and Datadog agent masking, this runs server-side after ingestion — meaning the raw event has already crossed your network boundary. For SaaS Splunk Cloud, data is already on Splunk's infrastructure by the time masking runs, which is why agent-level masking is always preferable.
Aggregator-level filtering is last resort
Logstash, Datadog, and Splunk all provide redaction capabilities. Treat them as a defence-in-depth layer, not a primary control. By the time an event reaches your aggregator, it has already been written to a log file on disk, been read by a shipping agent, and traversed your internal network. Sanitise at source.
PII in Logs Audit Checklist
Use this checklist to assess your current exposure. Each item maps to a specific, searchable risk in your aggregator:
body: or payload: in the last 24 hours. Any hits warrant immediate investigation. Even partial bodies logged for debugging are likely to contain passwords, tokens, or personal fields.WHERE followed by an email pattern. Django, SQLAlchemy, and ActiveRecord debug modes log full queries with bound values. If you can't disable this in production, add a Logstash filter to strip anything after WHERE.Bearer, eyJ (JWT prefix), and your session cookie name.level:INFO AND /@[a-z]/. Any results indicate a logging statement that needs to be changed to log a hashed identifier or nothing at all.FAQ
What is log aggregation? +
Log aggregation is the practice of collecting log output from multiple services, containers, and servers and routing it to a single centralised store — such as Elasticsearch, Splunk, or Datadog — where logs from all components can be searched together. Instead of SSH-ing into each machine and tailing files, you query one interface and see everything.
Why is PII a compliance risk in log aggregators? +
When logs containing personal data land in a centralised aggregator, that data is indexed, replicated across cluster nodes, and accessible to every developer with query access. A single user.email logged at DEBUG level sits in Elasticsearch for your entire retention period — typically 30–90 days — without a specific lawful basis for being there. Under GDPR Article 5 this violates both the data minimisation and storage limitation principles simultaneously.
Is Grafana Loki safer for PII than Elasticsearch? +
Loki is cheaper to operate because it only indexes labels, not the full log line text. But this doesn't reduce your PII risk — the raw log lines are still stored and accessible via log streaming queries. The difference is search performance: Loki makes it slower to find PII in raw text (no inverted index), but it's still findable. You still need to sanitise at source.
Can I use Logstash filters to remove all PII from my logs? +
Logstash gsub filters can catch structured PII like email addresses and known-format tokens. But PII in free text — a name embedded in an error message, an address echoed in a validation error — won't be caught by regex. Logstash filters are a useful defence-in-depth layer, but the primary fix must be at the application level: don't log the PII in the first place.
Sanitize Logs Before They Leave Your Machine
Remove emails, IPs, tokens and API keys from log snippets instantly — free, runs entirely in your browser, nothing uploaded.
Open Log Sanitizer — Free →