Problem Management

Leaky pipe Bad things happen. It can't be helped. Pipes can leak, and computer hardware and software can fail. Problem management is all about responding rapidly to fix these problems. In some cases, you might be able to fix a problem before anyone else knows.

Detecting Problems

You can't fix a problem you don't know about, and the faster you know about it the sooner you can fix it. You will hear about problems from users, but by the time this happens things can be very bad. Users provide information about the symptoms of a problem, not the problem's root cause. Monitoring software is the key to early problem detection. It provides the information you need to get at the problem's cause.

Sources of Problem Information

How can a monitoring program detect a problem? There are three main ways:

Systems can generate notifications about problems, or potential problems. Systems often do this using a standards based protocol, such as SNMP. A monitoring program can "listen" for notifications from such systems and inform you about a problem in several different ways.
Problems can also be detected by monitoring the value of performance counters that are provided by the operating system or other software. In this case, thresholds are provided for the counter that can inform you about the problem. As an example, you might want to know when processor utilization is greater than 90%.
A monitoring program can also detect problems by sending transactions to a system to make sure that it is responding. The time for the response can also be measured and information provided if it is too excessive. A similar approach can be used to test the network, by sending PING requests to servers.

Problems With Problem Management

Monitoring for problems is simple in principle, but more difficult in practice. One big problem is too much data. In a large network, a single failure can generate hundreds of notifications, which makes it difficult to determine the real cause.

Setting values for thresholds presents a similar problem. If you set the threshold too low, you will get many notifications when there really isn't a problem (a false positive). Set it too high, and you are likely to get no notification when there is a problem (a false negative).

Monitoring software can help by providing ways to deal with these problems. But configuring the monitoring software correctly is likely to be a difficult and time consuming activity.

Other Aspects of Performance Management

Service Level Management

Capacity Management

Performance Engineering

Problem Management

Detecting Problems

Sources of Problem Information

Problems With Problem Management