Configure Alarming

Big Sister implements alarming in a server based manner. The agent is responsible for determining if a system or service is working correctly ("green"), if it is critical ("yellow") or it has failed ("red") - other stati do exist but are not relevant to alarming.

This status is noticed by the alarming module of the server. Depending on the configuration filebb_event_generator.cfg the server generates alarms on status changes.

The alarming configuration mainly consists of a set of rules. Each rule consists of a pattern matched against all status change, a definition of dependencies and a description of the action to be taken when an alarm is raised. The first two elements describe under what circumstances an alarm is to be raised while the last one describes how actually the alarm is raised.

Using this simple approach a few things can easily be configured either for individual checks, for individual hosts or for whole groups:

The main disadvantage of the existing rule based alarming configuration is that it is very hard to find a simple way to explain how it works. Unfortunately you will just have to read the whole section and hopefully understand the configuration at the end.

Figure 3.1. Status Changes result in Alarms

Status Changes result in Alarms

Rules

An alarming rule in the bb_event_generator.cfg file always starts with a pattern followed by a description describing what actions should be taken if the pattern matches.

Whenever a status change is detected, bb_event_generator.cfg goes through the config file and looks for matching patterns. Each variable associated with the matching patterns is then set as described. If multiple patterns are matching the associated variables are set in order.

Every time a status change is noticed the alarm generator does two things:

  • go through the pending alarms and check if the status change has some effect on one of them

  • if the status change is not related with one of the pending alarms: go through the list of rules, select all the matching rules and raise an alarm depending on their descriptive part

Usually each line in the configuration file represents one rule. Of course like in most Big Sister configuration files empty lines and lines starting with a '#' char- acter are treated as comments and are therefore simply ignored. A rule may span multiple lines: Lines terminated with a '\' character are joined with their following line.

Patterns - "when to do things"

The most simple form of a pattern is a host.check pattern.

Simple rule

 foo.cpu mail=nobody

(where foo.cpu is the pattern and mail=nobody is the description) for instance matches only status changes for the host foo and the cpu check. The above rule tells Big Sister to forward alarms for foo.cpu to the user nobody.

Using wildcard in rules

 *.cpu mail=nobody

Now let's assume you do not want to list each individual system and check in the rule file. The Alarm Generator accepts one single wildcard - * - matching any check or any host

 foo.* mail=nobody

And for matching any status change for any host.

 *.* mail=nobody

Using groups in rules

Of course you may want to address a group of hosts - haven't you spent hours setting up groups after reading section 2.2.3? Exactly these groups are also visible to the alarm generator.

By prefixing a host name with a '@' character you point Big Sister to match a group rather than a single host so that a rule like

 @USA.* mail=nobody

for instance applies to any status change reported for any system being member of the group USA.

Using criteria in rules

So far so good. Sometimes it is very useful to be able to make alarming behave different depending on when a status change is detected - maybe you just refuse to be woken up by your beeper during the night or you want get alarms via another medium during working hours.

For this purpose the patterns can contain so-called pre-conditions. In the rule

 @USA.*{weekday Sat,Sun} mail=pikett

the stuff in parenthesis is a pre-condition. The rule will only match status changes for any system being member of the group USA reported during the weekend. An- other useful precondition is the daytime condition. This rule

 *.*{daytime 17:00-07:00} down=never

for instance will suppress (down=never) any status change reported between 5pm and 7am. Of course conditions can be combined using and and or, so

 *.*{daytime 17:00-07:00 or weekday Sat,Sun} \ down=never

will suppress any status change reported between 5pm and 7am or during week-end.

Description - "what to do"

Associated with each pattern there is a description in the form of a bunch of definitions. This set of definitions describes what actually will be done if a status change matching the pattern occurs. The rules will be processed in the order they appear in the configuration file and if multiple patterns match all the definitions will cumulate.

[Warning]Warning

Definitions appearing later in the file will overwrite definitions appearing earlier

Example 3.2.  Example default values

 *.* mail=alarm@nowhere.org delay=5 *.cpu delay=100

If a status change for myhost.conn is reported then only the first pattern will match resulting in a description of:

 mail=alarm@nowhere.org delay=5

while if a status change for myhost.cpu is reported both patterns will match and the resulting description would look like:

 mail=alarm@nowhere.org delay=20

thus the mail definition will be taken from the first rule while the delay definition of the second matching rule will replace the concurring definition in the first rule.

[Tip]Tip

It is a good idea to place more general rules near the start of the configuration file and more specific rules near the end. E.g. a rule associated with the pattern *.* is working like default settings since it will match every single status change.

 *.* mail=alarm delay=5 down=yellow up=green prio=5

Placed at the very start of the configuration it will initialize the settings for mail, delay, down, up and prio. Later rules may re-set one of these settings by at the same time inheriting all the other settings.

PAGER rules: influencing alarm delivery

Version 0.98 introduced special rules allowing us to modify the way how alarms are delivered. E.g. the rules

PAGER{$mail eq "someaddress@somehost"} pager=myscript mail=someaddress
PAGER{$pager eq "sendmail" and $mail eq "test"} mail=addr1,addr2,addr3

will re-direct alarms sent to someaddress@somehost to the address someadress and invoke myscript for sending the alarm (first rule). If the pager equals sendmail and the target address is test the alarm is redirected to the three addresses addr1,addr2 and addr3.

The PAGER-Rules are applied once per target address, thus the above rules would also apply if the page originated e.g. from a

  *.* mail=someaddress@somehost,test pager=sendmail

(this would cause the event generator to apply the PAGER rules twice: once for "someaddress@somehost", once for "test")

[Note]Note

Target addresses are split into "pager" and "mail" before PAGER rules are processed. An address like

mail=sendmail:test@somewhat.strange

will appear to the PAGER rules as mail=test@somewhat.strange pager=sendmail

Definitions and their meaning

Table 3.2. The alarm generator knows more definitions than just mail and pager settings

SettingPurpose
mailthe recipient address of an alert message. Multiple addresses are separated by comma, each address may be prefixed by a pager
upmailspecifies the recipients of the message that is being sent when the alarm condition is cleared ("up"-Mail). If not specified upmail defaults to the same recipient list as specified in the mail definition
priopriority level (0..100), this is not used by the alarm generator any more, instead you can use it in conditional rules
repeatif set, the alarm generator will send the alert message again all specified minutes until the alarm condition has cleared
repeatpriosame as for prio, but is the priority level for repeated alerts
keepthe number of minutes the alarm is not cleared by the alarm generator after the alarm condition is telling us that a service is up again
norepeatthe number of minutes new alerts for the same condition are suppressed
delaythe number of minutes between when the alarm condition is noticed by the alarm generator and an actual alert message is sent. If the alarm condition clears within this delay, no message is sent
checka boolean expression that is checked during the delay time interval. If the check condition is evaluating to false at least once during that interval, no alert message is sent
downone out of "green", "purple", "yellow", "red", "never": tells the alarm generator which status should be seen at as "down". I.e. "yellow" means that if a service status goes yellow or below (red), then the corresponding service is to be considered down
uplike down. Tells the alarm generator which status should be considered "up". I.e. "down=yellow up=green" will mean that a service is considered down from the time when it changes to yellow or red to the time when it goes to green again (but not if it is merely changing into purple)
maxmsga numeric value which is the maximum size of the shortened alarm message used in the subject of the alert
postponeif set, alarms shall not be sent for additional x minutes and rather stay in the queue. If during the postpone time period the alarm condition is cleared the alarm is silently dropped without ever sending an alert message. I.e. Postpone is meant to be used during night when you do not want to get woken up by alert messages but you want to get an alert as soon as you get to work if the problem persists
postpone_tois basically the same as postpone, but the value is a time of day rather than a time interval (i.e. "06:00")
pageruse specified pager program rather than the default (sendmail)
skinuse the skin specified here for alert messages
trapif set, the alarm generator will send a trap to the specified trap destination whenever sending alert/acknowledgement messages. The value of trap should be something like community@host
maxperminlimit the maximum number of alert messages being sent in one minute to this value. When this limit is exceeded alert messages are silently discarded (but the alarm is handled and appears on the alarming page). As of rev 1.01 this defaults to 20