

Kubernetes Audit-Ready Runtime Forensics in Under 24 Hours
Map pod → node → cloud principal → API action using eBPF plus CloudTrail and export a deterministic evidence bundle fast.
Introduction
Kubernetes incident forensics breaks down when you can’t prove which workload (pod/service account) triggered specific cloud control-plane API calls. The root cause is identity and network indirection: kube RBAC, IRSA/Workload Identity, node roles, NAT egress, and shared credentials blur attribution. Skynet’s approach is standardized execution: deploy an ephemeral, repeatable forensic stack, pull only a strict time window of telemetry, and deterministically map pod → node → cloud principal → API action. The outcome is an audit-ready timeline bundle with validation checks, produced in hours instead of days.Quick Take
- Pod-to-cloud attribution fails without pod-level runtime provenance and control-plane logs in the same time window.
- eBPF telemetry should be enriched with pod UID, serviceAccount, and image digest to withstand redeploy churn.
- Cloud control-plane logs (AWS CloudTrail, Azure Activity Log) must be queried for STS/role sessions and API calls, not just the final action.
- Deterministic correlation hinges on stable join keys: timestamps, node identity/ENI, source IP, and role session (principalId/sessionName).
- Deliverables should be a signed timeline plus “gaps checks” (missing audit logs, retention limits, clock drift) to avoid false confidence.
Standardized execution: the minimum forensic stack that actually correlates
What you must capture (and why)
To prove pod → cloud API action, you need three planes of evidence captured for the same bounded interval:- Runtime provenance (process + network) per pod: “what executed” and “what endpoints were contacted,” with Kubernetes identity attached.
- Kubernetes control-plane evidence: API server audit events for authn/authz context, object changes, and serviceAccount usage.
- Cloud control-plane evidence: API activity with the calling principal, session, source IP, and relevant resources.
Deploy an ephemeral eBPF sensor and collectors
Skynet’s runbook-style execution uses a short-lived forensic deployment that can be removed cleanly after evidence export. In practice, this is typically a DaemonSet for eBPF sensors plus minimal log collection.Example: deploy Cilium Tetragon as an eBPF sensor (cluster-specific manifests omitted intentionally; pin the version you already approve and keep it reproducible):
CODEBLOCK0
If you prefer policy-focused runtime visibility, Falco with eBPF can capture exec/connect signals as well. The key requirement is enrichment: pod UID, namespace, serviceAccount, and container image digest.
Time-bounding and evidence minimization
Standardized execution should enforce a tight scope:- Start time: earliest suspected malicious activity.
- End time: containment action completed (or a fixed “+2h” to catch stragglers).
- Collection: only the fields required for correlation and audit (avoid dumping entire clusters).
This isn’t about being conservative; it’s about producing an evidence bundle that is reviewable and defensible.
Capture pod-level provenance with eBPF (exec + connect) and enrich identity
Required fields for correlation
From the runtime side, you need enough to uniquely identify the workload and the action:- Event timestamp (high resolution if available)
- Pod UID, namespace, name
- serviceAccount name
- Node name
- Container image digest
- Process path/args (for exec)
- Destination IP/port and protocol (for connect)
Example: filter eBPF logs down to a suspect namespace and emit JSON lines for later joins:
CODEBLOCK1
Tie runtime events to node and infrastructure identity
For AWS, you’ll often need to map node → instanceId → ENI(s) → private IP(s). Capture these at collection time so correlation does not depend on later reconstruction.Example: export a node inventory snapshot (instance ID and provider IDs) from Kubernetes:
CODEBLOCK2
If you run EKS with IRSA, also snapshot serviceAccount annotations (role ARN) for the suspect namespaces:
CODEBLOCK3
Correlate to cloud control-plane activity (CloudTrail) and attribute principals
Query the right CloudTrail events for attribution
For AWS, attribution typically hinges on:- STS session establishment (AssumeRole, AssumeRoleWithWebIdentity, GetCallerIdentity)
- The target API actions of interest (e.g., PutObject, GetSecretValue, CreateAccessKey, ModifySecurityGroupRules)
Example: pull all role assumption events for the window:
CODEBLOCK4
Then pull actions for the suspected principal/role session(s). The most reliable join keys vary by organization, but common pivots include username (assumed-role ARN), principalId, sourceIPAddress, and eventTime.
CODEBLOCK5
Map CloudTrail source to Kubernetes nodes (and then pods)
You generally bridge cloud → cluster via one (or more) of:- sourceIPAddress (when it’s a node IP / egress IP you can map)
- VPC flow context (if available in your environment)
- ENI identifiers (from infrastructure inventory)
- role session naming conventions (if you set them deterministically)
This is why the node inventory snapshot matters. In many incidents, multiple pods share a node and share egress; you need runtime connect events to isolate which pod initiated outbound traffic at the same timestamps.
Build a deterministic timeline bundle (and prove your gaps)
Produce the evidence bundle: timeline + joins + validation
A defensible artifact set includes:timeline.jsonl(normalized events)joins.jsonl(explicit mapping podUID → principal/session → API action)inventory/(nodes, serviceAccounts/roles, container digests)validation.json(gaps checks and collection metadata)
Below is a minimal join script that:
- Reads Kubernetes audit log JSON lines (
kube-audit.jsonl) - Reads CloudTrail lookup output (
cloudtrail.actions.by-session.json) - Emits a joined, time-ordered timeline
CODEBLOCK6
This script is intentionally conservative: it normalizes and orders evidence. Your deterministic correlation step should then add explicit join logic based on your environment (node IP ↔ sourceIP, serviceAccount ↔ role ARN, and session ↔ API calls).
Gaps checks that prevent false conclusions
In addition to the timeline, produce explicit validation artifacts:- Audit log continuity: are there missing ranges or dropped events?
- Log retention: is the requested window fully retained (cluster + cloud)?
- Clock drift: are node and control-plane timestamps aligned?
- Identity binding completeness: do you have IRSA/Workload Identity bindings for all suspect serviceAccounts?
Operationalizing in Skynet: runbook-driven correlation in hours
The Pod → Principal Correlation Runbook
Skynet’s standardized execution focuses on repeatability and precision:- Deploy ephemeral eBPF sensors and bounded collectors
- Snapshot node and identity inventories
- Pull CloudTrail events for a fixed window (including STS establishment)
- Run deterministic correlation jobs and emit a fixed evidence bundle
- Execute validation checks and sign the output
The key is that every step is scripted, time-bounded, and produces machine-verifiable artifacts.
What “done” looks like
Your final deliverable should answer, without hand-waving:- Which pod UID (and image digest) initiated the activity
- Which serviceAccount and binding path applied (IRSA/Workload Identity/node role/static keys)
- Which cloud principal/session performed which API actions
- When it happened, with a contiguous timeline and declared gaps
Checklist
- [ ] Define the incident time window (start/end) and document the rationale
- [ ] Deploy an ephemeral eBPF sensor DaemonSet (e.g., Cilium Tetragon or Falco) pinned to an approved version
- [ ] Verify exec/connect events are captured and enriched with pod UID + serviceAccount + image digest
- [ ] Export a node inventory snapshot (node name, providerID, internal IP)
- [ ] Export serviceAccount identity bindings (e.g., IRSA role ARN annotations) for suspect namespaces
- [ ] Pull AWS CloudTrail events for STS establishment (AssumeRole/AssumeRoleWithWebIdentity) for the window
- [ ] Pull AWS CloudTrail events for target API actions for the same window and relevant principals
- [ ] Normalize and time-order evidence into a single timeline file
- [ ] Perform deterministic joins (pod → node → principal/session → API action) and emit a join artifact
- [ ] Run gaps checks (audit continuity, retention coverage, clock drift) and record results
- [ ] Package and hash/sign the evidence bundle for handoff and review
FAQ
What if multiple pods share the same node and egress IP?
Use runtime connect events from the eBPF sensor to attribute outbound connections to a specific pod UID at specific timestamps, then bridge to cloud logs via the node identity/egress context. If you rely on egress IP alone, you can narrow to a node but not prove the initiating workload.
Can I do this without Kubernetes audit logs?
You can still build strong attribution using eBPF runtime provenance plus CloudTrail, but you lose key control-plane context (who created/changed objects, token usage patterns, RBAC decisions). Treat missing audit logs as an explicit gap and document the impact in the validation artifact.
How do I handle time skew between nodes and cloud logs?
Capture node time/clock sync status during collection, then apply a measured offset if required. If you can’t quantify skew, avoid tight timestamp joins and instead join on session identifiers (principalId/sessionName) plus broader time windows, while documenting the reduced certainty.
Article written by Yassine Hadji
Cybersecurity Expert at Skynet Consulting
Citation
© 2026 Skynet Consulting. Merci de citer la source si vous reprenez des extraits.
Need help securing your infrastructure?
Discover our managed services and let our experts protect your organization.
Contact Us