In order to maintain a reliable IT environment, every enterprise needs to set up
an effective monitoring regime. A common mistake by new monitoring administrators is to alert on everything. Thi s is an ineffective strategy for several reasons. For starters, it may result in higher telecom charges for passing large numbers of alerts. Passing tons of irr elevant alerts will impact team morale. And, no matter how dedicated your team i s, you are guaranteed to reach a state where alerts will start being ignored bec ause "they're all garbage anyway." For example, it is common for non-technical managers to want to send alerts to t he systems team when system CPU hits 100%. But, from a technical perspective, th is is absurd: You are paying for a certain system capacity. Some applications (especially ones with extensive calculations) will use the full capacity of the system. This is a GOOD thing, since it means the calculations will be done sooner. What is it you are asking the alert recipient to do? Re-start the system? Kill th e processes that are keeping the system busy? If there is nothing for a the syst ems staff to do in the immediate term, it should be reported in a summary report , not alerted. If there is an indication (beyond a busy CPU) that there is a runaway process of some sort, the alert needs to go to the team that would make that determination and take necessary action. In order to be effective, a monitoring strategy needs to be thought out. You may end up monitoring a lot of things just to establish baselines or to view growth over time. Some things you monitor will need to be checked out right away. It i s important to know which is which. Historical information should be logged and retained for examination on an as-ne eded basis. It is wise to set up automated regular reports (distributed via emai l or web) to keep an eye on historical system trends, but there is no reason to send alerts on this sort of information. Availability information should be characterized and handled in an appropriate w ay, probably through a tiered system of notifications. Depending on the urgency, it may show up on a monitoring console, be rolled up in a daily summary report, or paged out to the on-call person. Some common types of information in this ca tegory include: "Unusual" log messages. Defining what is "unusual" usually takes some time to tun e whatever reporting system is being used. Some common tools include logwatch, s watch, and logcheck. Even though it takes time, your team will need to customize this list on their own systems. Hardware faults. Depending on the hardware and software involved, the vendor will have provided monitoring hooks to allow you to identify when hardware is failin g. Availability failures. This includes things like ping monitoring or other types o f connection monitoring that give a warning when a needed resource is no longer available. Danger signs. Typically, this will include anything that your team has identified that indicates that the system is entering a danger zone. This may mean certain types of performance characteristics, or it may mean certain types of system be havior. Alerting Strategy Alerts can come in different shapes, depending on the requirements of the enviro nment. It is very common for alerts to be configured to be sent to a paging queu e, which may include escalations beyond a single on-call person. (If possible, configure escalations into your alerting system, so that you are n ot dependent on a single person's cell phone for the availability of your entire enterprise. A typical escalation procedure would be for an unacknowledged alert to be sent up defined chain of escalation. For example, if the on-call person d oes not respond in 15 minutes, an alert may go to the entire group. If the alert is not acknowledged 15 minutes after that, the alert may go to the manager.) In some environments, alerts are handled by a round-the-clock team that is somet imes called the Network Operations Center (NOC). The NOC will coordinate respons e to the issue, including an evaluation of the alert and any necessary escalatio ns. Before an alert is configured, the monitoring group should first make sure that the alert meets three important criteria. The alert should be: 1.Important. If the issue being reported does not have an immediate impact, it s hould be included in a summary report, not alerted. Prioritize monitoring, alert ing, and response by the level of risk to the organization. 2.Urgent. If the issue does not need to have action taken right away, report it as part of a summary report. 3.Actionable. If no action can be taken by the person who receives the alert, it should have been defined to be sent to the right person. (Or perhaps the issue should be reported in a summary report rather than sent through the alerting sys tem.) Solaris Monitoring Suggestions Here are some monitoring guidelines I've implemented in some places where I have worked. You don't have to alert on a ton of different things in order to have a robust monitoring solution. Just these few items may be enough: Ping up/down monitoring. You really can't beat it for a quick reassurance that a given IP address is responding. Uptime monitoring. What happens if the system rebooted in between monitoring inte rvals? If you make sure that the uptime command is reporting a time larger than the interval between monitoring sweeps, you can keep an eye on sudden, unexpecte d reboots. Scan rate > 0 for 3 consecutive monitoring intervals. This is the best measure of memory exhaustion on a Solaris box. Run queue > 2x the number of processors for 3 consecutive monitoring intervals. T his is a good measure of CPU exhaustion. Service time (avserv in Solaris, svctime in Linux) > 20 ms for disk devices with more than 100 (r+w)/s, including NFS disk devices. This measures of I/O channel exhaustion. 20 ms is a very long time, so you will also want to keep an eye on t rends on regular summary reports of sar -d data. System CPU utilization > user CPU utilization where idle < 40% for systems that a re not serving NFS. This is a good indication of system thrashing behavior