Alerts

Rules, thresholds, escalation, silencing, cooldown.

A rule fires when its expression is true for count consecutive ticks. On fire, nanook dispatches an alert event to the rule's channel. If escalate.after is set and the rule keeps firing past that delay, an escalation event goes out too.

Anatomy

[[alerts]]
name     = "hot_cpu"           # optional id, defaults to expr; used by `firing(...)`
expr     = "cpu.usage > 90%"   # required, nanook-expr
count    = 3                   # consecutive ticks before firing (default 1)
channel  = "ops"               # required, channel id
action   = "log"               # log | webhook | discord | slack | exec (default log)
target   = ""                  # action-specific target (URL, command path, ...)
cooldown = "5m"                # min gap between consecutive fires
escalate = { after = "10m", action = "exec", target = "/usr/local/bin/page" }

How a rule evaluates

  1. Engine indexes every selector in every rule.
  2. On each tick, rules whose selectors got new data are evaluated.
  3. If expr is true, the rule's consecutive counter increments.
  4. At count the rule fires. A false tick resets the counter.

count is a sensitivity dial: 1 is twitchy, 12 is patient. Missing or zero count is treated as 1.

Cooldown

cooldown is the minimum gap between consecutive fires of the same rule. Without it, a permanently-failing predicate fires every tick.

[[alerts]]
expr     = "disk.usage > 95%"
cooldown = "30m"
channel  = "ops"

After a fire, the rule stays muted for 30 minutes even if the predicate is still true. The counter ticks; dispatch is suppressed.

Escalation

If a rule stays firing for escalate.after past its first fire, the engine dispatches an escalate event on the same channel, carrying escalate.action / escalate.target for a separate handler:

[[alerts]]
expr     = 'api::http.status is "false"'
count    = 2
channel  = "ops"
action   = "log"
escalate = { after = "5m", action = "exec", target = "/usr/local/bin/page-oncall" }

A single channel handler receives both fire and escalation. To split them onto two channels (casual + paging), declare a second rule with the same expression, longer count, and the paging channel.

Cross-rule references

A rule expression can reference another rule's firing state via the firing("name") subquery. Combine with && for gates, || for fallbacks, or mix with metric thresholds. See nanook-expr · Subqueries.

[[alerts]]
name    = "hot_cpu"
expr    = "cpu.usage > 90%"
count   = 3
channel = "ops"

# child fires only while hot_cpu is firing AND mem is also high
[[alerts]]
expr    = 'firing("hot_cpu") && mem.usage > 80%'
count   = 1
channel = "ops"

References resolve by name (defaults to the rule expression, so unnamed rules can be referenced by their predicate text). Unknown names are a hard error at load: nanook check surfaces a nanook::engine::unknown_rule_ref diagnostic anchored to the offending firing(...) call. The dependency graph is derived from the parsed expression automatically.

Silencing

To temporarily mute a rule (e.g. during planned maintenance):

nanook ctl silence "cpu.usage > 90%" 1h     # mute for an hour
nanook ctl unsilence "cpu.usage > 90%"      # bring it back

The expr argument is the rule expression, not a substring match. It must equal what's in nanook.toml (whitespace is normalized). Silences accept any duration string: 30s, 15m, 1h30m.

Actions

The action field decides how the channel handler delivers the alert. Defaults to log:

ActionWhattarget
logprint through the agent's tracing layerunused
webhookPOST a JSON payloadURL
discordPOST a Discord webhook payloadwebhook URL
slackPOST a Slack incoming webhook payloadwebhook URL
execrun a commandbinary path

The same five values are accepted on escalate.action. Channel type and rule action are independent: a log channel can carry a rule whose action = "exec" if the handler supports it.

Body templating

The body field on a rule is a nanook-template rendered against the alert event each fire. Templates parse once at config-load; a parse error logs a warning and disables the body (the rule still fires, just without the override). Empty bodies are "no override".

Render context: kind (fire, resolve, escalate), rule, channel, message, trigger.name, trigger.val, trigger.labels.<key>, trigger.source, at (RFC3339).

[[alerts]]
expr    = "cpu.usage > 90%"
channel = "ops"
body    = "{{ trigger.labels.host or \"unknown\" }} hot at {{ trigger.val }}%"

For long bodies, use the @file: include (works on opt strings):

[channels.ops]
type = "slack"
[channels.ops.opts]
url  = "${SLACK_WEBHOOK_URL}"
body = "@file:./templates/cpu.tpl"

Each action decides what to do with the rendered body:

ActionWhere the body lands
logreplaces the printed message
execexposed as the NANOOK_ALERT_BODY env var alongside the existing NANOOK_ALERT_* set
webhook / slack / discordbecomes the JSON body (overrides the channel-level body opt, which overrides the action's default payload shape)
pluginbecomes AlertPayload.message; plugin code reads it like any other field

Precedence for webhook-style channels: rule body > channel body opt > action default. A rule body lets one alert opt out of the channel's house format.

Patterns

Page only when sustained

[[alerts]]
expr     = "cpu.usage > 80%"
count    = 60                # 5m at 5s interval
channel  = "ops"
cooldown = "15m"

Two-tier severity

[[alerts]]
expr    = "disk.usage > 85%"
channel = "log"

[[alerts]]
expr    = "disk.usage > 95%"
channel = "oncall"

Escalate flapping endpoints

[[alerts]]
expr     = 'api::http.status is "false"'
count    = 2
channel  = "ops"
escalate = { after = "10m", action = "slack", target = "${SLACK_ONCALL_URL}" }

Cross-collector predicates

[[alerts]]
expr    = 'cpu.usage > 80% && api::http.latency > 500ms'
count   = 6
channel = "ops"

See also