You are on page 1of 2

Effective Solaris System Monitoring

In order to maintain a reliable IT environment, every enterprise needs to set up


an effective monitoring regime.
A common mistake by new monitoring administrators is to alert on everything. Thi
s is an ineffective strategy for several reasons. For starters, it may result in
higher telecom charges for passing large numbers of alerts. Passing tons of irr
elevant alerts will impact team morale. And, no matter how dedicated your team i
s, you are guaranteed to reach a state where alerts will start being ignored bec
ause "they're all garbage anyway."
For example, it is common for non-technical managers to want to send alerts to t
he systems team when system CPU hits 100%. But, from a technical perspective, th
is is absurd:
You are paying for a certain system capacity. Some applications (especially ones
with extensive calculations) will use the full capacity of the system. This is a
GOOD thing, since it means the calculations will be done sooner.
What is it you are asking the alert recipient to do? Re-start the system? Kill th
e processes that are keeping the system busy? If there is nothing for a the syst
ems staff to do in the immediate term, it should be reported in a summary report
, not alerted.
If there is an indication (beyond a busy CPU) that there is a runaway process of
some sort, the alert needs to go to the team that would make that determination
and take necessary action.
In order to be effective, a monitoring strategy needs to be thought out. You may
end up monitoring a lot of things just to establish baselines or to view growth
over time. Some things you monitor will need to be checked out right away. It i
s important to know which is which.
Historical information should be logged and retained for examination on an as-ne
eded basis. It is wise to set up automated regular reports (distributed via emai
l or web) to keep an eye on historical system trends, but there is no reason to
send alerts on this sort of information.
Availability information should be characterized and handled in an appropriate w
ay, probably through a tiered system of notifications. Depending on the urgency,
it may show up on a monitoring console, be rolled up in a daily summary report,
or paged out to the on-call person. Some common types of information in this ca
tegory include:
"Unusual" log messages. Defining what is "unusual" usually takes some time to tun
e whatever reporting system is being used. Some common tools include logwatch, s
watch, and logcheck. Even though it takes time, your team will need to customize
this list on their own systems.
Hardware faults. Depending on the hardware and software involved, the vendor will
have provided monitoring hooks to allow you to identify when hardware is failin
g.
Availability failures. This includes things like ping monitoring or other types o
f connection monitoring that give a warning when a needed resource is no longer
available.
Danger signs. Typically, this will include anything that your team has identified
that indicates that the system is entering a danger zone. This may mean certain
types of performance characteristics, or it may mean certain types of system be
havior.
Alerting Strategy
Alerts can come in different shapes, depending on the requirements of the enviro
nment. It is very common for alerts to be configured to be sent to a paging queu
e, which may include escalations beyond a single on-call person.
(If possible, configure escalations into your alerting system, so that you are n
ot dependent on a single person's cell phone for the availability of your entire
enterprise. A typical escalation procedure would be for an unacknowledged alert
to be sent up defined chain of escalation. For example, if the on-call person d
oes not respond in 15 minutes, an alert may go to the entire group. If the alert
is not acknowledged 15 minutes after that, the alert may go to the manager.)
In some environments, alerts are handled by a round-the-clock team that is somet
imes called the Network Operations Center (NOC). The NOC will coordinate respons
e to the issue, including an evaluation of the alert and any necessary escalatio
ns.
Before an alert is configured, the monitoring group should first make sure that
the alert meets three important criteria. The alert should be:
1.Important. If the issue being reported does not have an immediate impact, it s
hould be included in a summary report, not alerted. Prioritize monitoring, alert
ing, and response by the level of risk to the organization.
2.Urgent. If the issue does not need to have action taken right away, report it
as part of a summary report.
3.Actionable. If no action can be taken by the person who receives the alert, it
should have been defined to be sent to the right person. (Or perhaps the issue
should be reported in a summary report rather than sent through the alerting sys
tem.)
Solaris Monitoring Suggestions
Here are some monitoring guidelines I've implemented in some places where I have
worked. You don't have to alert on a ton of different things in order to have a
robust monitoring solution. Just these few items may be enough:
Ping up/down monitoring. You really can't beat it for a quick reassurance that a
given IP address is responding.
Uptime monitoring. What happens if the system rebooted in between monitoring inte
rvals? If you make sure that the uptime command is reporting a time larger than
the interval between monitoring sweeps, you can keep an eye on sudden, unexpecte
d reboots.
Scan rate > 0 for 3 consecutive monitoring intervals. This is the best measure of
memory exhaustion on a Solaris box.
Run queue > 2x the number of processors for 3 consecutive monitoring intervals. T
his is a good measure of CPU exhaustion.
Service time (avserv in Solaris, svctime in Linux) > 20 ms for disk devices with
more than 100 (r+w)/s, including NFS disk devices. This measures of I/O channel
exhaustion. 20 ms is a very long time, so you will also want to keep an eye on t
rends on regular summary reports of sar -d data.
System CPU utilization > user CPU utilization where idle < 40% for systems that a
re not serving NFS. This is a good indication of system thrashing behavior

You might also like