Alerts
Rules, thresholds, escalation, silencing, cooldown.
A rule fires when its expression is true for count consecutive ticks. On fire, nanook dispatches an alert event to the rule's channel. If escalate.after is set and the rule keeps firing past that delay, an escalation event goes out too.
Anatomy
[[]]
= "hot_cpu" # optional id, defaults to expr; used by `firing(...)`
= "cpu.usage > 90%" # required, nanook-expr
= 3 # consecutive ticks before firing (default 1)
= "ops" # required, channel id
= "log" # log | webhook | discord | slack | exec (default log)
= "" # action-specific target (URL, command path, ...)
= "5m" # min gap between consecutive fires
= { = "10m", = "exec", = "/usr/local/bin/page" }
How a rule evaluates
- Engine indexes every selector in every rule.
- On each tick, rules whose selectors got new data are evaluated.
- If
expris true, the rule's consecutive counter increments. - At
countthe rule fires. A false tick resets the counter.
count is a sensitivity dial: 1 is twitchy, 12 is patient. Missing or zero count is treated as 1.
Cooldown
cooldown is the minimum gap between consecutive fires of the same rule. Without it, a permanently-failing predicate fires every tick.
[[]]
= "disk.usage > 95%"
= "30m"
= "ops"
After a fire, the rule stays muted for 30 minutes even if the predicate is still true. The counter ticks; dispatch is suppressed.
Escalation
If a rule stays firing for escalate.after past its first fire, the engine dispatches an escalate event on the same channel, carrying escalate.action / escalate.target for a separate handler:
[[]]
= 'api::http.status is "false"'
= 2
= "ops"
= "log"
= { = "5m", = "exec", = "/usr/local/bin/page-oncall" }
A single channel handler receives both fire and escalation. To split them onto two channels (casual + paging), declare a second rule with the same expression, longer count, and the paging channel.
Cross-rule references
A rule expression can reference another rule's firing state via the firing("name") subquery. Combine with && for gates, || for fallbacks, or mix with metric thresholds. See nanook-expr · Subqueries.
[[]]
= "hot_cpu"
= "cpu.usage > 90%"
= 3
= "ops"
# child fires only while hot_cpu is firing AND mem is also high
[[]]
= 'firing("hot_cpu") && mem.usage > 80%'
= 1
= "ops"
References resolve by name (defaults to the rule expression, so unnamed rules can be referenced by their predicate text). Unknown names are a hard error at load: nanook check surfaces a nanook::engine::unknown_rule_ref diagnostic anchored to the offending firing(...) call. The dependency graph is derived from the parsed expression automatically.
Silencing
To temporarily mute a rule (e.g. during planned maintenance):
The expr argument is the rule expression, not a substring match. It must equal what's in nanook.toml (whitespace is normalized). Silences accept any duration string: 30s, 15m, 1h30m.
Actions
The action field decides how the channel handler delivers the alert. Defaults to log:
| Action | What | target |
|---|---|---|
log | print through the agent's tracing layer | unused |
webhook | POST a JSON payload | URL |
discord | POST a Discord webhook payload | webhook URL |
slack | POST a Slack incoming webhook payload | webhook URL |
exec | run a command | binary path |
The same five values are accepted on escalate.action. Channel type and rule action are independent: a log channel can carry a rule whose action = "exec" if the handler supports it.
Body templating
The body field on a rule is a nanook-template rendered against the alert event each fire. Templates parse once at config-load; a parse error logs a warning and disables the body (the rule still fires, just without the override). Empty bodies are "no override".
Render context: kind (fire, resolve, escalate), rule, channel, message, trigger.name, trigger.val, trigger.labels.<key>, trigger.source, at (RFC3339).
[[]]
= "cpu.usage > 90%"
= "ops"
= "{{ trigger.labels.host or \"unknown\" }} hot at {{ trigger.val }}%"
For long bodies, use the @file: include (works on opt strings):
[]
= "slack"
[]
= "${SLACK_WEBHOOK_URL}"
= "@file:./templates/cpu.tpl"
Each action decides what to do with the rendered body:
| Action | Where the body lands |
|---|---|
log | replaces the printed message |
exec | exposed as the NANOOK_ALERT_BODY env var alongside the existing NANOOK_ALERT_* set |
webhook / slack / discord | becomes the JSON body (overrides the channel-level body opt, which overrides the action's default payload shape) |
| plugin | becomes AlertPayload.message; plugin code reads it like any other field |
Precedence for webhook-style channels: rule body > channel body opt > action default. A rule body lets one alert opt out of the channel's house format.
Patterns
Page only when sustained
[[]]
= "cpu.usage > 80%"
= 60 # 5m at 5s interval
= "ops"
= "15m"
Two-tier severity
[[]]
= "disk.usage > 85%"
= "log"
[[]]
= "disk.usage > 95%"
= "oncall"
Escalate flapping endpoints
[[]]
= 'api::http.status is "false"'
= 2
= "ops"
= { = "10m", = "slack", = "${SLACK_ONCALL_URL}" }
Cross-collector predicates
[[]]
= 'cpu.usage > 80% && api::http.latency > 500ms'
= 6
= "ops"
See also
- nanook-expr · predicate language
- nanook-templates · body language
- Channels · delivery destinations
- ctl · silence, pause, trigger from the command line