Skip to main content

Why Stave is a Risk Reasoning Engine

Stave is not a scanner. It is a deterministic risk reasoning engine that transforms individual configuration findings into a structured argument about compound risk.

From observation to inference

Traditional security tools produce lists: "This bucket is public. That key is unrotated. Logging is disabled." Each finding is independent. The auditor must reason about how they combine.

Stave automates that reasoning. It takes three independent facts (public bucket, PHI tag, no CloudTrail) and infers that together they constitute a total safety envelope failure. That inference step is what separates a reasoning engine from a scanner.

Three-layer risk scoring

Layer 1: Environmental
base_impact × asset_sensitivity × exposure_vector
→ "How bad is this finding given what it protects?"

Layer 2: Compound
environmental × chain_escalation × blast_multiplier
→ "How bad is this combination of failures?"

Layer 3: Attack Stage Summary
Map each MITRE stage to its worst severity
→ "Where are the structural gaps in your defenses?"

The attack stage summary maps MITRE ATT&CK-aligned stages to the worst severity observed in that stage:

  • initial_access — public S3, open security groups, public RDS
  • credential_access — MFA failures, key rotation
  • persistence — IAM self-modification, break-glass
  • exfiltration — encryption controls
  • detection_evasion — CloudTrail, GuardDuty, Config
  • resilience — backups, versioning, Object Lock

Environmental scoring

A public S3 bucket in a sandbox account is a bug. The same misconfiguration on a PHI bucket in production is a breach path. Environmental scoring captures this distinction:

Asset sensitivityMultiplier
phi / cde3.0
production2.0
internal1.0
dev / sandbox0.5
Exposure vectorMultiplier
public_internet2.0
cross_account1.5
vpc_internal1.0
no_network0.5

Chain escalation

Analysis engines treat findings as a list. Reasoning engines treat findings as a graph. Stave's chain definitions model how one failure weakens another:

1 control failing:  1.0x (no escalation)
2 controls failing: 1.8x
3+ controls failing: 2.5x (bounded asymptote)

The escalation is intentionally not purely multiplicative — a bucket that's public + unencrypted + unlogged is catastrophically worse than public alone, but the marginal risk has diminishing returns.

Blast radius multiplier

Some controls, when disabled, make all other violations invisible. CloudTrail being disabled is a medium finding on its own. But it multiplies the risk of every other finding because there is no evidence trail for investigation.

Detection controls (CloudTrail, GuardDuty): 2.5x blast multiplier
Prevention controls (PAB, MFA): 1.0x (default)
Recovery controls (backups, versioning): 1.0x (default)

Exposure ranking: finding the silent killers

Beyond scoring individual findings and chains, the reasoning engine ranks every finding by exposure to answer the question auditors actually ask: what do I fix first? The ranking surfaces long-lived, high-impact failures that have persisted undetected — the silent killers.

Exposure score combines the base score with how long the failure has been live, its blast radius, and whether it is reachable from outside:

ExposureScore = BaseScore × DurationFactor × BlastMultiplier × ExposureMultiplier

Duration is the differentiator. A public bucket discovered today is a problem; a public bucket that has been public for four years is a breach that already happened. The duration factor steps up sharply with age:

DurationFactorLabel
< 30 days1.0Recent
30-89 days1.5Aging
90-364 days2.0Stale
365-1642 days3.0Long-lived
1643+ days (4.5 years)5.0Silent killer

Findings that have gone unobserved for more than ~300 days are flagged as silent killers. The ranking is deterministic — the same inputs always produce the same order, with ties broken by control ID then asset ID.

Safety chains

Chains are the inference rules of the reasoning engine. They define which controls form a compound risk when co-failing:

ChainControlsThresholdWhat it means
public_phi_exposurePUBLIC + ENCRYPT + LOG + CLOUDTRAIL2PHI exposed without protection
root_compromise_pathROOT.MFA + ROOT.ACCESSKEY + POLICY.ADMIN2Root account lacks defenses
detection_blindnessCLOUDTRAIL + GUARDDUTY + CONFIG + FLOWLOG2Monitoring disabled, all violations invisible

Chains live in chains/*.yaml — auditable, version-controlled, and extensible by users.

Deterministic reasoning vs probabilistic guessing

This is Stave's sharpest competitive edge:

AI-powered toolsStave
MethodProbabilistic model (black box)Deterministic logic (transparent)
Score derivation"The model says high risk""PHI × public × no CloudTrail = 150.0"
AuditabilityTrust the algorithmRead the invariants
ReproducibilityVaries between runsIdentical for same input

When an auditor asks "How did the tool arrive at this Critical score?", Stave provides the logic trace of the reasoning chain. It isn't a score from an algorithm — it's a logical conclusion from a set of invariants.

Responsibility boundaries

Assessor    → "Did this control pass or fail?" (observation)
RiskEngine → "What does this pattern of failures mean?" (inference)
Reporter → "How do we explain this reasoning?" (attestation)

The Logic Trace is what makes the reasoning auditable. It's a first-class data structure, not a log string.

Time is the second axis scanners miss

Compound reasoning is one half of what separates Stave from a scanner; time is the other. A scanner sees existence ("this bucket is public"). Stave sees persistence ("this bucket has been public for 9 days, past your 7-day bound").

Most breaches are not caused by a misconfiguration existing — they are caused by it existing long enough. A public bucket for three hours is often fine; for ninety days it is a breach. A leaked admin token rotated in five minutes costs nothing; left for weeks it becomes attacker persistence. Stave encodes this as a duration bound: an unsafe_duration control fires only when the unsafe state has persisted past the threshold, which filters transient noise from deployments, testing, and in-flight migrations.

Because evaluation runs over a series of immutable, timestamped snapshots rather than a single live query, the historical record itself becomes a capability scanners cannot offer:

Use caseScanner (no history)Stave (with history)
Prove SLA complianceNo"Fixed on 2026-01-08, within the 7-day SLA"
Grace periods for deploysNoTracks but does not alert until threshold
Regression detectionNoSafe Jan 5–Feb 10, then unsafe again
Trend / posture reportingNo15 → 12 → 8 → 5 unsafe resources over time
Root-cause correlationNofirst_unsafe_at lines up with a commit/deploy
Tamper-evident audit trailNoImmutable per-timestamp JSON snapshots
"Was it ever safe?"NoGrep the snapshot series for the property

This is why the evidence on a finding is temporal, not point-in-time: each finding carries first_unsafe_at, last_seen_unsafe_at, unsafe_duration_hours, and the threshold_hours it crossed — auditable proof of when and for how long, not just that a problem exists.

What this means for users

  • Security engineers: Define safety chains for your environment. A chain is a security expert's mental model encoded as YAML.
  • Compliance teams: The output is a structured argument, not a checklist. Sign off on a proof, not a score. History lets you prove remediation happened within SLA instead of asserting it.
  • CISOs: The attack stage summary tells you where your defenses have structural gaps — in language aligned to MITRE ATT&CK. Posture trends across snapshots show whether security debt is shrinking.