The audit log that watched the front door while the side door stayed open

Metadata

Title: The audit log that watched the front door while the side door stayed open
Source of the case: HackerOne report #3021451 (AWS)
AWS service(s): CloudTrail, ElastiCache
Risk archetype: false protection (latent audit blind spot)
One-line hook: Can you prove this multi-region, all-events trail actually captures every ElastiCache call?

0. The challenge (what the reader does first)

Scenario given to the reader:

A team relies on CloudTrail as their audit record for the ElastiCache control plane. The trail is multi-region and logging is on. Its event selectors capture all read/write events and include management events. By every visible measure, coverage is complete. The same API actions can be reached through a second, non-production service endpoint — elasticache.us-east-1.api.aws — in addition to the standard one.

Evidence they're handed (and nothing else):

{
  "service": "elasticache",
  "trail": {"IsMultiRegionTrail": true, "IsLogging": true},
  "event_selectors": [{"ReadWriteType": "All", "IncludeManagementEvents": true}],
  "non_production_endpoint": "elasticache.us-east-1.api.aws",
  "events_logged_via_standard_endpoint": true,
  "events_logged_via_non_production_endpoint": false
}

No AWS credentials. No live account. No scripts.

The questions they must answer from the evidence alone:

Reading only the trail and selector fields, does coverage look complete — and is that conclusion safe?
The trail looks total, yet one path produces no entries: which endpoint silently bypasses the audit log, and how would anyone notice the gap from this evidence?
Which calls are recorded — path A, the standard endpoint?
Which calls vanish — path B, the non-production endpoint?
What single compensating rule would have surfaced ElastiCache activity that the trail never sees?

1. The manual problem

To answer by hand you read the trail config and conclude it is healthy: multi-region, logging on, ReadWriteType: All, management events included. There is nothing in those fields to flag. That is exactly the problem — the evidence that would reveal the gap is not in the trail object at all.

The gap lives in the difference between two facts at the bottom of the snapshot: events_logged_via_standard_endpoint is true, events_logged_via_non_production_endpoint is false. To catch it manually you have to already know that ElastiCache exposes a non-production *.api.aws endpoint, that calls through it are not delivered to CloudTrail, and that an attacker who routes mutations through that endpoint leaves no trace. None of that is derivable from reading the trail settings. You are reasoning about the absence of records that a correctly configured trail simply never receives — the hardest kind of finding, because the evidence is a silence.

2. The reasoning wall (capture, don't invent)

What they hit	What they said / would say
The trail config is unambiguously green	"Multi-region, all events, management on. There's nothing to fix here."
Coverage is asserted by config, not proven by records	"We assumed 'all events' meant every endpoint. It means every event the trail actually receives."
The dangerous path produces nothing to look at	"If someone uses the `api.aws` endpoint, there's no log line — and no log line is exactly what we'd never go looking for."

The insight the reader should reach on their own:

A complete-looking trail config is not proof of complete coverage; an endpoint the trail never sees is a blind spot no setting reveals.

3. Why scanners miss or flatten it

A per-setting scanner reads the trail and reports green: trail enabled, multi-region, all read/write events, management events included. Every box it checks is genuinely checked. What it cannot do is reason that "all events" is scoped to the events the trail is actually delivered — and that ElastiCache's non-production endpoint routes around that delivery. The scanner evaluates the configuration of the trail; the vulnerability is the existence of a service endpoint whose traffic never reaches the trail. There is no setting that says "this endpoint is uncovered," so a node-by-node check has nothing to flag. It marks the audit log as complete precisely in the case where the audit log has a hole.

Pivot point. Everything above is the gap. Everything below is Stave filling it. The reader has now done the work and hit the wall. Only now does the tool appear.

4. The evidence Stave consumes

The same static observation snapshot the reader had: the trail state, the event selectors, the declared non-production endpoint, and the two delivery facts (standard endpoint logged, non-production endpoint not logged).

{
  "service": "elasticache",
  "trail": {"IsMultiRegionTrail": true, "IsLogging": true},
  "event_selectors": [{"ReadWriteType": "All", "IncludeManagementEvents": true}],
  "non_production_endpoint": "elasticache.us-east-1.api.aws",
  "events_logged_via_standard_endpoint": true,
  "events_logged_via_non_production_endpoint": false
}

No new privileges, no live cloud call. The trail config is normalized alongside the per-endpoint delivery facts, so configuration completeness and actual coverage are evaluated as separate properties.

5. The reasoning Stave performs

Control / invariant: CTL.CLOUDTRAIL.ENABLED.001 — CloudTrail must be enabled and its coverage must reach every endpoint through which the audited service can be reached.
What it evaluates: Is the trail enabled and logging (path A confirms the standard endpoint is recorded)? And is there any endpoint through which the service is reachable whose events are not delivered to the trail (path B, the non-production endpoint with events_logged_via_non_production_endpoint: false)? The control does not stop at the trail being "on"; it checks coverage against the set of reachable endpoints.
Verdict produced: NON_COMPLIANT — the trail is enabled but a reachable endpoint bypasses it, so coverage is not provably complete. The blind spot is named rather than hidden behind the trail's green config.

control: CTL.CLOUDTRAIL.ENABLED.001
asset:   cloudtrail/elasticache control-plane coverage
evidence: trail enabled (multi-region, all events) but events via elasticache.us-east-1.api.aws are not delivered (events_logged_via_non_production_endpoint = false)
verdict: NON_COMPLIANT

6. The prevention artifact Stave produces

The trail itself is correctly configured, so the artifact is not a trail change — it is a compensating detective control that observes the path CloudTrail cannot see.

Artifact: A VPC Flow Logs / DNS-resolution monitoring rule that alerts on traffic to the non-standard *.api.aws ElastiCache endpoint.
What it forecloses: The latent state from question 2 — ElastiCache activity routed through an endpoint that produces no CloudTrail entries. Calls the trail never sees now produce a network-layer signal instead.

# Compensating detective control: network-layer visibility for the uncovered endpoint
detect:
  source: vpc_flow_logs + route53_resolver_query_logs
  match:
    - dns_query_name: "elasticache.us-east-1.api.aws"     # non-production endpoint
    - destination matches resolved IPs for *.api.aws ElastiCache endpoint
  on_match:
    severity: high
    alert: "ElastiCache call via non-production endpoint not covered by CloudTrail"
    correlate_with: cloudtrail(elasticache)   # flag activity present here but absent in the trail

7. What the team no longer does manually

Before	After Stave
Read the trail config, see all-green, and call audit coverage complete	One control separates "trail enabled" from "coverage reaches every endpoint"
Have no way to notice a path that produces no log lines	The uncovered non-production endpoint is named as a blind spot, not a silence
Trust the audit log as a complete record by configuration alone	A compensating network-layer control surfaces activity the trail never receives

Positioning line for this case

Stave proves that this correctly configured trail still has a hole — ElastiCache calls through a non-production endpoint that generate no audit entry — names the endpoint that bypasses logging, and emits the network-layer detective control that makes the silent path visible.

Reuse checklist

A reader could attempt section 0 with zero Stave knowledge
Stave is not named or shown before the pivot point
Section 2 quotes are real (or honestly plausible), not slogans
Section 3 names the specific thing per-setting tools can't see
Section 6 closes the exact latent state raised in section 0, question 2
The title names the failure, not the product

Metadata​

0. The challenge (what the reader does first)​

1. The manual problem​

2. The reasoning wall (capture, don't invent)​

3. Why scanners miss or flatten it​

4. The evidence Stave consumes​

5. The reasoning Stave performs​

6. The prevention artifact Stave produces​

7. What the team no longer does manually​

Positioning line for this case​

Reuse checklist​