Use and configure SLA/Availability management

The following instructions show you how you can use the Big Sister log files to work out for how long a certain network device has been up/down during a certain time range. The result (usually a valua like m.n%) can be used to evaluate if a certian device or service meets our expectations or not. I.e. it is quite common to negotiate a guaranteed availability (in %) of a leased line. This is called an SLA. Usually carriers give discounts or pay even penalties if they couldn't provide the negotiated availability.

There are two approaches to understanding reporting. First of all, there are two fundamental commands called report_read and report_consolidate.

report_read reads in a specified file with status information or dependency rules for one specific day and stores the results in var/reportdb/day-yyyy-mm-dd.statuslog. Multiple runs of report_read – most probably you will at least read in the display.history and some dependency definitions - may incrementally update the same file in var/reportdb. One of the functions of report_read is the "Cumulator" which builds var/reportdb/*cumu* files out of the statuslog files by applying service hours / holidays definitions to the status information. Usually you will have multiple cumu files since you are interested in seeing statistics for multiple service levels. The files generated by report_read are actually of minor interest to you – they just serve as a kind of cache in order to reduce time spend in daily statistics operation. Another effect of this caching is that you do not need storing status information (display history) for the whole time period you are interested in.

report_consolidate then will go through all the cumu files in a specified time period, sum up time spent in specific status and create report files

var/reportdb/day-yyyy-mm-dd.statistics.cumuclass.name

for each cumulated file (textttreport_read) and each defined time period. These files are what you actually want to get.

Setting it up Availability management

Theory sounds rather complex, doesn't it? Let's go ahead to the real world then. In order to simplify the use of the reporting module an additional command report_day has been added. Usually you will just forget about report_read and report_consolidate and just use report_day which tries to do a mixture of report_reads and report_consolidates itself. As its name implies report_day is meant to be run on a daily basis. It will build statis- tics based on the following files:

Table 3.3. Files used for Availability management

FilePurpose
display.history.*Big Sister status history
reporting/servicehoursdefining when system are expected to be working
reporting/holidaysdefining holidays (aka. time periods when systems are notexpected to be in service)
reporting/cumulatorsdefining which service levels should be reported (linking servicehours and holidays to status information)
reporting/override is meant to store manually maintained status information overriding display.history
reporting/dependencies telling us which services should be watched and how they depend on status information in display.history
reporting/statisticsdefining which time periods should be consolidated and what exactly we would like to see in the resulting report

Best is to start with the simple things: servicehours and holidays. These files are rather self explaning – in servicehours you can define "in service" hours for each day of the week, while in holidays you can exclude whole days from service time. Note that the first column in each of these files contain a "class" specifier. This allows to define multiple levels of service – e.g. some systems might to be expected to run 24 hours 7 days a week while others are only expected to run from 8:00 till 17:00 on Monday through Friday. Define classes for all these service levels.

Servicehours and holidays will not be effective on their own. The rules actually linking service levels with status information are listed in the cumulators file. In this file you actually define your service levels based on servicehours and holidays.

Each rule looks like

levelname = service:class > holidays:class

Do not mind if you do not really understand what exactly ">" means. "levelname" is a symbolic name you can freely choose - at the very end you will get report files carrying "levelname" in their names.

It is time now for the more complex things: dependencies. In a real world you are usually not interested in seeing statistics for simple things like myserver.conn or myserver.smtp – in the SMTP case for instance you probably have multiple re- dundant mail servers increasing the overall mail service availability. So the mail service is available if any of your mail servers is up and running. In the dependen- cies file you tell the reporting modules which dependencies apply to your systems. The above rule would maybe look like:

mailservice = history:myserver1.smtp | history:myserver2.smtp

telling that mailservice is up if at least one of myserver1.smtp and myserver2.smtp is up. Note the leading "history:". Every information holder's name in the reporting tool is preceeded by a prefix like e.g.:

history: – display.history information

service: – servicehours information

comp: – dependencies

holidays: – holidays information

override: – override information

Let's assume that the mail servers above depend on DNS to work correctly. So you define a DNS rule (let's assume you run two redundant DNS servers dns1 and dns2):

dns = history:dns1.dns | history:dns2.dns

Now you can use the result of the dns rule to make your mailservice rule more realistic:

mailservice = comp:dns  \
(history:myserver1.smtp | history:myserver2.smtp)

do not be fooled by the fact that the rule is now put on two lines – this is just for readability (the at the end of line one tells the reporting module the rule will be continued on the next line). Note that "dns" is in fact "comp:dns" – dependencies get automatically prefixed with "comp:".

Note that the resulting statistics will only contain services defined via the depen- dencies mechanism. Also, dependencies with names starting with "_" are not ap- pearing in the final output file – you can use such names as "internal variables".

Of course some times your monitor will fail and report nonsense (e.g. because your agent or server got cut off). In this case you need a means for manually correcting such mistakes. This is done via the "override" mechanism. E.g. add the following line to the override file:

mailservice,28.6.01 12:00,29.6.01 09:00,green,I know it worked

saying that mailservice was completely ok from June 28 12:00 till June 29 09:00. This line on its own does not yet change the statistics result. You have to set up your dependencies accordingly. The full mailservice rule should then look something like

mailservice = comp:dns  \
(history:myserver1.smtp | history:myserver2.smtp) \
> override:mailservice

The ">" (override) operator tells us to prefer the expression on the right side of ">" to the expression on the left side for the whole time period(s) the expression on the right side is defined. In other words: whenever override:mailservice is defined the whole expression basing on history status is ignored and overriden by override:mailservice – actually this is probably what you guessed a long time before.

There is only one thing left you need to setup before we can get the first statistics: the statistics file. In this file you specify what "things" should be reported and which time periods the statistics should cover. As for now we can go along with the default file. It does report:

Table 3.4. Output of the default statistics file

Downtime (secs) a service was red
N/Atime (secs) a service was purple
Uptime (secs) a service was yellow or green
Planned Outagetime (secs) a service was white
Service Hourstime (secs) a service should have been in service (according to the cumulators file)
Down%Down/Service Hours
Up%Up/Service Hours
AvailabilityThe services's availability

Note again that only dependencies (names starting with comp:) will appear in the final output. So as long as you do not define dependencies you will get no results. A dependency can of course be as simple as

comp:myrouter = history:myrouter.conn

Running the reporting tool

The easiest way to run the reporting tool is via the report_day wrapper, e.g.:

 bin/report_day

report_day is meant to be run every day and will by default read in status his- tory / servicehours / holidays for the last two days and dependencies / cumulators / overrides for the last 10 days.

When running bin/textttreport_day for the first time you might want to compute statistics for a longer time back, do it with e.g.:

bin/report_day -h 30 -c 30

-h: read history / servicehours / holidays 30 days back, -c: read the rest of the files also 30 days back

You will see that this will take some time because report_day will compute each individual day separately. This is very inefficient for long time periods since the whole reporting system is optimized for daily (incremental) use.

The results

You will get a bunch of files in var/reportdb. The most interesting files are day*.statistics.class.period. E.g. a file

day-2001-06-28.statistics.level1.oneweek

contains the statistic for the time from June 22 to June 28 2001 and the service level level1.

Note that though you defined oneweek to consolidate status for 7 days the reporting tool will create a statistics file for every day containing the last 7 days. This is intentionally setup like that. Probably you are only interested in one of these files per week though.

Using savelogs/archivelogs and availability management

The reporting tool is compatible with savelogs/archivelogs (see Using the bsadmin tool and The reference section of the savelogs command). Actu- ally the display.history reader will not only read in display.history but also display.history.* files. If you archive your logs outside of var/display.history.* this will even drastically improve the efficiency of the daily incremental report genera- tion steps since it is actually sufficient to have display history files available for 3 days back only!

Some of the file formats look very funny. Some of the file formats are actually CSV (well, more or less – do not try to use commas in cells!). That's why they look rather unfriendly to a vi user. The file formats where defined with the idea in mind that someone might use some spreadsheet or database application for editing or importing/exporting the files. CSV is one of the formats most applications understand.