The two side doors the audit log was never told about

Metadata

Title: The two side doors the audit log was never told about
Source of the case: HackerOne report #3022516 (AWS)
AWS service(s): CloudTrail, Amazon Forecast
Risk archetype: false protection (vendor-side audit blind spot)
One-line hook: Can you prove this all-events trail records every way the Forecast API can be called?

0. The challenge (what the reader does first)

Scenario given to the reader:

A team treats CloudTrail as the system of record for Amazon Forecast API activity. The trail is multi-region and logging. Its event selectors capture all read/write events and include management events — a textbook complete configuration. Forecast, however, can also be reached through two non-production endpoints (forecast.us-east-1.api.aws and forecast-fips.us-east-1.api.aws), and calls through those are not delivered to the trail.

Evidence they're handed (and nothing else):

{
  "service": "forecast",
  "trail": {"IsMultiRegionTrail": true, "IsLogging": true},
  "event_selectors": [{"ReadWriteType": "All", "IncludeManagementEvents": true}],
  "non_production_endpoints": ["forecast.us-east-1.api.aws", "forecast-fips.us-east-1.api.aws"],
  "events_logged_via_non_production_endpoints": false
}

No AWS credentials. No live account. No scripts.

The questions they must answer from the evidence alone:

Reading only the trail and selector fields, does coverage look complete — and can that conclusion be trusted?
The trail config is total, yet some paths emit nothing: how many endpoints silently bypass the audit log, and how would anyone detect the gap from this evidence?
Is the customer's trail misconfigured, or is the gap on the service side?
Which calls vanish — the calls routed through the two non-production endpoints?
What single compensating rule would surface Forecast activity that the trail never receives?

1. The manual problem

To answer by hand you read the trail config and find nothing wrong: multi-region, logging on, all read/write events, management events included. The fields you would normally inspect are all healthy, which is precisely why this is hard. The flaw is not a mistake the customer made; it is a property of the Forecast service.

The only hint is the last line of the snapshot: events_logged_via_non_production_endpoints: false, against a list of two endpoints. To turn that into a finding manually you must know that Forecast exposes *.api.aws and *-fips.api.aws endpoints, that AWS's own CloudTrail integration for Forecast does not deliver events from those endpoints, and that no amount of customer-side configuration closes the gap. You are reasoning about records that will never exist no matter how the trail is set up — and about whose fault it is, because the natural instinct ("fix the trail") is wrong here.

2. The reasoning wall (capture, don't invent)

What they hit	What they said / would say
Trail config is green and the gap isn't theirs to fix	"There's nothing to change in our config. The trail is right. So where's the hole?"
Two endpoints, not one, are uncovered	"It's not a single edge case — there are two endpoints, including the FIPS one, and neither shows up."
The audit record has a blind spot regardless of effort	"We can't configure our way out of this. We need to watch the network, not the trail."

The insight the reader should reach on their own:

A perfectly configured trail can still be incomplete when the service itself never delivers some of its own calls — and you can only close that gap from outside the trail.

3. Why scanners miss or flatten it

A per-setting scanner reads the trail and reports green: enabled, multi-region, all events, management events included. Each checked box is correct, and the scanner has no reason to look further — there is nothing to remediate in the configuration. What it cannot reason about is that Amazon Forecast exposes two non-production endpoints whose traffic AWS does not feed into CloudTrail, so "all events" silently excludes them. The scanner evaluates the trail; the vulnerability lives in the service's own integration, on the vendor side, with two endpoints (including the FIPS variant) producing no records. There is no setting that encodes "these endpoints are uncovered," so a node-by-node tool reports a complete audit log at exactly the moment two side doors are unwatched.

Pivot point. Everything above is the gap. Everything below is Stave filling it. The reader has now done the work and hit the wall. Only now does the tool appear.

4. The evidence Stave consumes

The same static observation snapshot the reader had: the trail state, the event selectors, the two declared non-production endpoints, and the delivery fact that events through those endpoints are not logged.

{
  "service": "forecast",
  "trail": {"IsMultiRegionTrail": true, "IsLogging": true},
  "event_selectors": [{"ReadWriteType": "All", "IncludeManagementEvents": true}],
  "non_production_endpoints": ["forecast.us-east-1.api.aws", "forecast-fips.us-east-1.api.aws"],
  "events_logged_via_non_production_endpoints": false
}

No new privileges, no live cloud call. The trail config is normalized alongside the set of reachable endpoints, so a correct configuration and incomplete coverage are evaluated as distinct properties.

5. The reasoning Stave performs

Control / invariant: CTL.CLOUDTRAIL.ENABLED.001 — CloudTrail must be enabled and its coverage must reach every endpoint through which the audited service can be reached.
What it evaluates: Is the trail enabled and logging with full selectors? And is there any reachable endpoint whose events are not delivered to the trail? Here two endpoints (forecast.us-east-1.api.aws, forecast-fips.us-east-1.api.aws) report events_logged_via_non_production_endpoints: false. The control attributes the gap correctly: the trail is compliant by configuration, but coverage is not provably complete because the service does not deliver those endpoints' events.
Verdict produced: NON_COMPLIANT — the trail is correctly configured, yet two reachable endpoints bypass it, so the audit record has a vendor-side blind spot. The finding names the gap as service-side rather than a customer misconfiguration.

control: CTL.CLOUDTRAIL.ENABLED.001
asset:   cloudtrail/forecast control-plane coverage
evidence: trail enabled (multi-region, all events) but 2 endpoints uncovered (forecast.us-east-1.api.aws, forecast-fips.us-east-1.api.aws); events_logged_via_non_production_endpoints = false
verdict: NON_COMPLIANT

6. The prevention artifact Stave produces

The trail is correctly configured and the gap is on the service side, so the artifact is a compensating detective control that observes the paths CloudTrail cannot.

Artifact: A VPC Flow Logs / DNS-resolution monitoring rule that alerts on traffic to either non-standard Forecast endpoint, including the FIPS variant.
What it forecloses: The latent state from question 2 — Forecast activity routed through the two non-production endpoints that produce no CloudTrail entries. Calls the trail never receives now raise a network-layer signal.

# Compensating detective control: network-layer visibility for the uncovered endpoints
detect:
  source: vpc_flow_logs + route53_resolver_query_logs
  match:
    - dns_query_name in:
        - "forecast.us-east-1.api.aws"          # non-production endpoint
        - "forecast-fips.us-east-1.api.aws"     # FIPS non-production endpoint
    - destination matches resolved IPs for those *.api.aws Forecast endpoints
  on_match:
    severity: high
    alert: "Forecast call via non-production endpoint not covered by CloudTrail"
    correlate_with: cloudtrail(forecast)   # flag activity present here but absent in the trail

7. What the team no longer does manually

Before	After Stave
Read the trail config, see all-green, and conclude Forecast auditing is complete	One control separates "trail correctly configured" from "coverage reaches every endpoint"
Have no way to attribute a gap that isn't a customer misconfiguration	The two uncovered endpoints are named as a service-side blind spot, not a config error
Trust the audit log as a complete record despite vendor-side delivery gaps	A compensating network-layer control surfaces activity the trail never receives

Positioning line for this case

Stave proves that this correctly configured trail still misses Forecast calls through two non-production endpoints, names the vendor-side blind spot rather than a customer misconfiguration, and emits the network-layer detective control that makes the silent paths visible.

Reuse checklist

A reader could attempt section 0 with zero Stave knowledge
Stave is not named or shown before the pivot point
Section 2 quotes are real (or honestly plausible), not slogans
Section 3 names the specific thing per-setting tools can't see
Section 6 closes the exact latent state raised in section 0, question 2
The title names the failure, not the product

Metadata​

0. The challenge (what the reader does first)​

1. The manual problem​

2. The reasoning wall (capture, don't invent)​

3. Why scanners miss or flatten it​

4. The evidence Stave consumes​

5. The reasoning Stave performs​

6. The prevention artifact Stave produces​

7. What the team no longer does manually​

Positioning line for this case​

Reuse checklist​