bb_event_generator.cfg
DESCRIPTION
Big Sister implements alarming in a server based manner. The agent is responsible
for determining if a system or service is working correctly (green), if it is critical
(yellow) or it has failed (red) - other stati do exist but are not relevant to alarming.
This status is noticed by the alarming module of the server. Depending on
the configuration file adm/bb_event_generator.cfg the server generates
alarms on status changes.
The alarming configuration mainly consists of a set of rules. Each rule consists of
a pattern matched against all status changes, a definition of dependencies and a description
of the action to be taken when an alarm is raised. The first two elements
describe under what circumstances an alarm is to be raised while the last one describes
how actually the alarm is raised. Using this simple approach a few things
can easily be configured either for individual checks, for individual hosts or for
whole groups:
- Wait for a defined time period before reporting an alarm and do not report an
alarm if the problem goes away within this period
- Regularly send reminders telling the administrator that a problem persists
until the problem goes away
- Do not repeatedly send alarms for a multiply occurring problem
- Behave different depending on time of day or day of week (e.g. postpone
alarms raised during the night to the early morning)
- Suppress alarms depending on what status other systems/services are in (e.g.
do not report that a system is unreachable when Big Sister already knows
that the whole network the system is connected to is down)
Associated with each pattern there is a description in the form of a bunch of definitions.
This set of definitions describes what actually will be done if a status change
matching the pattern occurs. The rules will be processed in the order they appear
in the configuration file and are cumulative if multiple patterns match.
Definitions appearing later in the file will overwrite definitions appearing
earlier, e.g.:
*.* mail=alarm@nowhere.org delay=5
*.cpu delay=100
If a status change for myhost.conn is reported then only the first pattern will
match resulting in a description of:
mail=alarm@nowhere.org delay=5
while if a status change for myhost.cpu is reported both patterns will match and
the resulting description would look like:
mail=alarm@nowhere.org delay=100
thus the mail definition will be taken from the first rule while the delay definition
of the second matching rule will replace the concurring definition in the first
rule.
It is a good idea to place more general rules near the start of the configuration
file and more specific rules near the end. E.g. a rule associated with the pattern
*.* is working like default settings since it will match every single status change.
Consider
*.* mail=alarm delay=5 down=yellow up=green prio=5
Placed at the very start of the configuration it will initialize the settings for mail,
delay, down, up and prio. Later rules may reset one of these settings by at the
same time inheriting all the other settings.
FIELDS
- COMMENT:
The comment field has no bearing on the operation of the file. This is simply an area where the admin can leave notes about the configuration changes he/she has made.
- LAST MODIFIED:
This is a read only field. This entry is automatically written to the file by the web interface each time the file is updated.
- HOST:
This column is a drop down list of all the known servers within the BS realm. The list is derived from ~bbuser/adm/bb-display.cfg and includes both host names and groups. Selecting an asterisk applies the filter to all hosts (and implicity all groups). Toggling a hostname or groupname to a null (whitespace) entry will delete that row/rule. Groups are prepended with an "@" symbol and are listed before the hosts.
- AGENT:
This column is a drop down list of all the known agents that BS is monitoring. The list is pulled from the ~bbuser/var/agent.log file. As in the host column, the asterisk applies the filter to all agents.
- TIME:
This column is actually refered to as a "pre-condition" within BS documentation. This column creates filters for days and times. For example:
{weekday Sat,Sun} would set a condition that the alarm must occur during the days set by the keyword "weekday" (a weekend in this case).
{daytime 17:00-07:00} would set a condition that the alarm would be tripped only if the status change occured between 5:00 pm and 7:00 am, set using the keyword "daytime".
{daytime 17:00-07:00 or weekday Sat,Sun} illustrates the ability to combine keywords and settings. The keyword "and" is also permitted.
- CHECK:
The syntax for this check is
host.check{condition}
where condition is a boolean expression, e.g.
*.*{$mail == "test"} mail="nobody"
special functions are 'daytime' and 'weekday', they can be used
like this:
*.*{daytime 22:00-06:00 or weekday Sat,Sun} postpone=30
*.*{daytime 22:00-06:00} postpone_to=06:00
and 'check' may be either an asterisk "*" matching any check or
a check as displayed in the columns of the status display.
Whenever a status change is detected, bb_event_generator.cfg goes
through the config file and looks for matching patterns. Each
variable associated with the matching patterns is then set as
listed. If multiple patterns are matching the associated variables
are set in order.
Interpreted variables are:
- mail - mail addresses where to send alarm (comma separated)
- prio - priority level (0..100)
- repeat - if set bb_event_generator will send the alarm again all
x minutes until the alarm condition has cleared
- repeatprio - the priority level for repeated alarms (see "repeat")
- keep - the duration in minutes the alarm is not cleared
by the event_generator after the alarm condition
is telling us that everything is ok again
- norepeat - the duration in minutes no alarm can be sent for
the same condition
- delay - the duration in minutes between when the alarm
is raised and sent to the user
- check - a boolean expression that is checked during the
'delay' time and forces the alarm to be aborted
if the condition is not met once during this time
- down - (one out of "green", "purple", "yellow", "red",
"never") tells the event generator which status
should be interpreted as "down". E.g.: "yellow"
means that if the status is "yellow" or below
("red") is detected then the corresponding
service is down.
- up - (like down) tells the event generator which
status should be considered as "up". E.g.
down=yellow up=green means that a service
is considered as down from the time when it
changes to yellow or red to the time when it
goes to "green" again (but not if it's going
to "purple"!)
- maxmsg - a numeric value which is the maximum size of
a message sent in the subject line of the alarm
mail (e.g. if you send it through a pager gateway
...)
- postpone - if set alarms won't be sent for additional x minutes
and rather stay in the queue. If during the postpone
time period the alarm condition is cleared the alarm
is silently thrown away. Postpone is meant to be used
e.g. during night when you don't want to get an alarm.
- postpone_to - same as postpone but the value is expected to be a
daytime rather than an interval (e.g. "06:00").
- pager - use alternative pager program (instead of the
default 'log_mail', e.g. 'notify' is a good
choice!)
- trap - if set bb_event_generator will raise an SNMP event
for any alarm/acknowledgement. The contents of trap
is a trap destination composed of a community and
a host of the form community@host. If the community
is missing "public" is assumed.
ROUTER:
PRIO:
a number between 0 (completely unimportant) and 100 (extremely critical)
describing the importance of the alarm. The priority settings can be used in
pre-conditions and are otherwise passed through to the alarming
methods.
E.g. for alarms sent via E-Mail the priority will only appear in the
message text and does not have any influence on how the alarm is treated.
DOWN:
is set to a status color or "never". Status colors equal or below this color are considered a
failure, thus an alarm is raised if a status change occurs from a color above
this color. E.g.
down=yellow
will make Big Sister raise an alarm if a status changes from green to yellow
or green to red but not if a status changes from green to purple.
UP:
works similiar to down but defines when a status will be considered to go up.
By default up is the same as the next higher color of down. Sometimes it
might be useful to re-define this. Consider
down=yellow up=green
This will raise an alarm on status change from e.g. green to red. If the status
goes up to purple (aka. no information) the alarm will not be cleared. It will
only be cleared as soon as we get a green (aka. everything is ok).
DELAY:
is a time in minutes. Whenever an alarm is raised it first goes into a pool
of alarms just about to be raised. It stays in this pool for the delay time. If
during this time the alarm condition clears (service up) the alarm is silently
dropped. If an alarm is still pending after the delay time the alarm is finally
sent to the administrator.
REPEAT:
is a time period in minutes. If an alarm stays active for some time every
repeat minutes Big Sister will send a reminder message to the recipient of
the original alarm message. It is suggested to use this feature only for really
important alarms since most administrators will probably just get annoyed
when continuously reminded of the same failures. Note: This is not related
to norepeat in any way.
REPEATPRIO:
is the prioirty (see above) of reminder messages
NOREPEAT:
after an alarm is cleared it goes into the pool of old (remembered) alarms
and stays there for the norepeat time period. As long as an alarm is either
pending, active or remembered no new alarm for the same host/status is
raised. The meaning of “norepeat” therefore is: Do not send an alarm again
for the same condition for this delay. The norepeat period starts when the
alarm is raised. Therefore it is of course possible that the norepeat delay
is already over when an alarm gets cleared and therefore the respective alarm
is immediately thrown out of the pool of remembered alarms.
KEEP:
is also a time in minutes. After an alarm condition clears the alarm is kept
active for the keep time period. Only after this time the administrator will
get an “alarm cleared” message and the alarm will go in the pool of old
alarms. Do not ask me what this is useful for.
MAIL:
names the recipient (mail address or pager number or whatever depending on
the value of the pager variable) alarms are sent to.
MSGMAX:
PAGER:
tells Big Sister which program it should use for sending out alarms. The
default is "notify" which is a pager program included with Big Sister. It is
not a bad idea to just keep this default.
TRAP:
If trap is set Big Sister will sent out an SNMP trap on each alarm raise/clear.
The value of trap is of the form
trap=community@host
You will find the Big Sister MIB (if you do not know what a MIB is you
do not need one) as well as format file for HP Openview in the contrib
directory of the source distribution.
POSTPONE:
is a time period in minutes. After an alarm becomes active Big Sister
waits for postpone minutes before it really sends out a message. If during
this period the alarm is cleared it is silently dropped without a message. This
is nearly the same as delay.
POSTPONE_TO:
Is exactly the same as postpone. But the value is not exactly a time
period in minutes - it is an absolute time of day, e.g.
postpone_to=06:00
will postpone alarms to 6 am. Note that the time is in 24h notation so 8pm
for instance is 20:00, not 08:00pm.
EXAMPLES
Usually you will put a general rule with a pattern matching any
host/check and the default variable values as your first rule, e.g.:
# default values
*.* mail=alarm prio=50 norepeat=20 down=yellow up=green maxmsg=60
if you do not want to get an alarm about e.g. smtp being down when you
already know that the connection to the host is down then you could
use the following rule for instance:
*.smtp delay=5 check="$host.conn"
(semantics: if the "conn" goes down within 5 minutes after smtp down is
detected then throw away the smtp alarm, otherwise send it
after 5 minutes)
If your very important machines are in a group called "IMPORTANT" then
you may wish to do something like:
@IMPORTANT.* prio=100 repeat=30 repeatprio=60
(semantics: if a service of a machine in the group IMPORTANT goes down
then send an alarm with priority 100 and send a reminder
with priority 60 each 30 minutes ("yell for help"))
If the machines in a group EAST are all located in a network connected
to router "router-east" then you may get plenty of alarms when
"router-east" goes down since any machine behind is unreachable. You can
avoid this by e.g.:
@EAST.conn check=router-east.conn delay=5
router-east.conn check="1" delay=0
(semantics: if a host is in group EAST and the connection to it goes
down wait for five minutes and if within these five minutes
the connection to router-east is lost too then do not
send an alarm for this host. If the host is the router
itself send an alarm immediately)
or
@EAST.* router=router-east
*.conn check="($router.conn) or not $router" delay=5
router-east.conn check="1" delay=0
(semantics: if a host is in group EAST set the variable "router" to
"router-east". If the connection to any host is going
down then wait for five minutes and check if either there
is no router configured for this machine or the connection
to the router goes down as well. Discard the alarm if
the router goes down. Of course except for if the machine
is the router itself)
NOTE: you cannot use variables in patterns, so e.g. the example above
cannot be written as (not yet):
@EAST.* router=router-east
$router.conn check=1 delay=0
Postpone is used during times when system failures are less important,
e.g. during night. You can postpone alarms for a time interval:
*.*{daytime 22:00-06:00} postpone=60
This will tell the event generator to keep a raising alarm in the postpone
queue for 1h before sending an alarm mail. If during this time the alarm
condition clears no alarm is sent at all. If you never want to be waked
up by alarms, then
*.*{daytime 22:00-06:00} postpone_to=06:00
might be what you want (Semantics: when an alarm is detected during night
send it at 06:00)