bb_event_generator.cfg

DESCRIPTION

Big Sister implements alarming in a server based manner. The agent is responsible for determining if a system or service is working correctly (green), if it is critical (yellow) or it has failed (red) - other stati do exist but are not relevant to alarming.
This status is noticed by the alarming module of the server. Depending on the configuration file adm/bb_event_generator.cfg the server generates alarms on status changes. The alarming configuration mainly consists of a set of rules. Each rule consists of a pattern matched against all status changes, a definition of dependencies and a description of the action to be taken when an alarm is raised. The first two elements describe under what circumstances an alarm is to be raised while the last one describes how actually the alarm is raised. Using this simple approach a few things can easily be configured either for individual checks, for individual hosts or for whole groups: Associated with each pattern there is a description in the form of a bunch of definitions. This set of definitions describes what actually will be done if a status change matching the pattern occurs. The rules will be processed in the order they appear in the configuration file and are cumulative if multiple patterns match.
Definitions appearing later in the file will overwrite definitions appearing earlier, e.g.:
*.* mail=alarm@nowhere.org delay=5
*.cpu delay=100
If a status change for myhost.conn is reported then only the first pattern will match resulting in a description of:
mail=alarm@nowhere.org delay=5
while if a status change for myhost.cpu is reported both patterns will match and the resulting description would look like:
mail=alarm@nowhere.org delay=100
thus the mail definition will be taken from the first rule while the delay definition of the second matching rule will replace the concurring definition in the first rule.
It is a good idea to place more general rules near the start of the configuration file and more specific rules near the end. E.g. a rule associated with the pattern
*.* is working like default settings since it will match every single status change.
Consider
*.* mail=alarm delay=5 down=yellow up=green prio=5

Placed at the very start of the configuration it will initialize the settings for mail, delay, down, up and prio. Later rules may reset one of these settings by at the same time inheriting all the other settings.

FIELDS

  • ROUTER:
  • PRIO:
    a number between 0 (completely unimportant) and 100 (extremely critical) describing the importance of the alarm. The priority settings can be used in pre-conditions and are otherwise passed through to the alarming methods.
    E.g. for alarms sent via E-Mail the priority will only appear in the message text and does not have any influence on how the alarm is treated.
  • DOWN:
    is set to a status color or "never". Status colors equal or below this color are considered a failure, thus an alarm is raised if a status change occurs from a color above this color. E.g.
    down=yellow
    will make Big Sister raise an alarm if a status changes from green to yellow or green to red but not if a status changes from green to purple.
  • UP:
    works similiar to down but defines when a status will be considered to go up. By default up is the same as the next higher color of down. Sometimes it might be useful to re-define this. Consider
    down=yellow up=green
    This will raise an alarm on status change from e.g. green to red. If the status goes up to purple (aka. no information) the alarm will not be cleared. It will only be cleared as soon as we get a green (aka. everything is ok).
  • DELAY:
    is a time in minutes. Whenever an alarm is raised it first goes into a pool of alarms just about to be raised. It stays in this pool for the delay time. If during this time the alarm condition clears (service up) the alarm is silently dropped. If an alarm is still pending after the delay time the alarm is finally sent to the administrator.
  • REPEAT:
    is a time period in minutes. If an alarm stays active for some time every repeat minutes Big Sister will send a reminder message to the recipient of the original alarm message. It is suggested to use this feature only for really important alarms since most administrators will probably just get annoyed when continuously reminded of the same failures. Note: This is not related to norepeat in any way.
  • REPEATPRIO:
    is the prioirty (see above) of reminder messages
  • NOREPEAT:
    after an alarm is cleared it goes into the pool of old (remembered) alarms and stays there for the norepeat time period. As long as an alarm is either pending, active or remembered no new alarm for the same host/status is raised. The meaning of “norepeat” therefore is: Do not send an alarm again for the same condition for this delay. The norepeat period starts when the alarm is raised. Therefore it is of course possible that the norepeat delay is already over when an alarm gets cleared and therefore the respective alarm is immediately thrown out of the pool of remembered alarms.
  • KEEP:
    is also a time in minutes. After an alarm condition clears the alarm is kept active for the keep time period. Only after this time the administrator will get an “alarm cleared” message and the alarm will go in the pool of old alarms. Do not ask me what this is useful for.
  • MAIL:
    names the recipient (mail address or pager number or whatever depending on the value of the pager variable) alarms are sent to.
  • MSGMAX:
  • PAGER:
    tells Big Sister which program it should use for sending out alarms. The default is "notify" which is a pager program included with Big Sister. It is not a bad idea to just keep this default.
  • TRAP:
    If trap is set Big Sister will sent out an SNMP trap on each alarm raise/clear. The value of trap is of the form
    trap=community@host
    You will find the Big Sister MIB (if you do not know what a MIB is you do not need one) as well as format file for HP Openview in the contrib directory of the source distribution.
  • POSTPONE:
    is a time period in minutes. After an alarm becomes active Big Sister waits for postpone minutes before it really sends out a message. If during this period the alarm is cleared it is silently dropped without a message. This is nearly the same as delay.
  • POSTPONE_TO:
    Is exactly the same as postpone. But the value is not exactly a time period in minutes - it is an absolute time of day, e.g.
    postpone_to=06:00
    will postpone alarms to 6 am. Note that the time is in 24h notation so 8pm for instance is 20:00, not 08:00pm.

    EXAMPLES

    Usually you will put a general rule with a pattern matching any host/check and the default variable values as your first rule, e.g.:
    # default values
    *.* mail=alarm prio=50 norepeat=20 down=yellow up=green maxmsg=60
    if you do not want to get an alarm about e.g. smtp being down when you already know that the connection to the host is down then you could use the following rule for instance:
    *.smtp delay=5 check="$host.conn"
    (semantics: if the "conn" goes down within 5 minutes after smtp down is detected then throw away the smtp alarm, otherwise send it after 5 minutes) If your very important machines are in a group called "IMPORTANT" then you may wish to do something like:
    @IMPORTANT.* prio=100 repeat=30 repeatprio=60
    (semantics: if a service of a machine in the group IMPORTANT goes down then send an alarm with priority 100 and send a reminder with priority 60 each 30 minutes ("yell for help")) If the machines in a group EAST are all located in a network connected to router "router-east" then you may get plenty of alarms when "router-east" goes down since any machine behind is unreachable. You can avoid this by e.g.:
    @EAST.conn check=router-east.conn delay=5
    router-east.conn check="1" delay=0
    (semantics: if a host is in group EAST and the connection to it goes down wait for five minutes and if within these five minutes the connection to router-east is lost too then do not send an alarm for this host. If the host is the router itself send an alarm immediately) or
    @EAST.* router=router-east
    *.conn check="($router.conn) or not $router" delay=5
    router-east.conn check="1" delay=0
    (semantics: if a host is in group EAST set the variable "router" to "router-east". If the connection to any host is going down then wait for five minutes and check if either there is no router configured for this machine or the connection to the router goes down as well. Discard the alarm if the router goes down. Of course except for if the machine is the router itself) NOTE: you cannot use variables in patterns, so e.g. the example above cannot be written as (not yet):
    @EAST.* router=router-east
    $router.conn check=1 delay=0
    Postpone is used during times when system failures are less important, e.g. during night. You can postpone alarms for a time interval:
    *.*{daytime 22:00-06:00} postpone=60
    This will tell the event generator to keep a raising alarm in the postpone queue for 1h before sending an alarm mail. If during this time the alarm condition clears no alarm is sent at all. If you never want to be waked up by alarms, then
    *.*{daytime 22:00-06:00} postpone_to=06:00
    might be what you want (Semantics: when an alarm is detected during night send it at 06:00)