Statistical ErrorsSolution/Problem on blackboard

When making a decision using statistics, two types of errors can arise. A Type I error (also called a false positive) occurs when you reject a null hypothesis when it is actually true. A Type II error (a false negative), occurs when you do not reject a null hypothesis when it is actually false.

In the context of performance monitoring, If we take a sample and determine there is a problem when there isn't, we have a false positive (Type I) error. If you decide there isn't a problem, when there actually is one, then we have a false negative (Type II) error. These types of errors can make a monitoring system useless for detecting real problems.

For more information, see the Wikipedia article on Type I and Type II errors.

Moving Averages

Stock chart

Computer systems often exhibit "bursty" behavior. Determining a level that represents a problem or potential problem for such data is difficult, and maybe impossible. If you set it too low, you will get many false positives. Set it too high, and you will get many false negatives. Processor utilization is a good example. High CPU utilization for a short period of time isn't much of a problem, but long periods are.

Moving averages are useful for dealing with bursty behavior. If you have ever looked at stock charts that include moving averages, you see how they work. They smooth out the jagged patterns in the data to make the actual trend easier to see.

For more information, see the Wikipedia article on Moving average.

Rule Based Notifications

Speed signSome types of problems and potential problems are easy to detect from the values of sampled data. Consider, for example, available disk space. When the percentage of available disk space is very low, there is a good chance you may run out. It would be great to know about these problems early so that you can fix them before they become serious.

ArteMon's rule based notifications let you do exactly that. After each sample is taken, ArteMon determines the highest severity level threshold, if any, crossed. If a notification hasn't been sent for that level, then one is sent. As things change, the severity level of the incident changes based on the current sampled value. If the value falls low enough, the incident is cleared.

Rule based notifications contain all of the standard properties of an event based notification. And like an event based notification, you can provide JavaScript to execute before the notification is sent.

A simple numeric threshold often works fine for triggering a notification. For some performance measures, however, a simple numeric threshold causes problems. This is particularly true for performance measures that fluctuate rapidly. A good example is processor utilization, which often has short bursts of high values. It is hard to set a simple numeric threshold that detects a problem when it occurs, but doesn't send many notifications when there isn't a real problem. In other cases, the data item may not even be numeric. It could be a string value. To deal with these situations, ArteMon includes five different types of rules, which are explained below.

Simple ThresholdSimple Numeric Thresholds

The simple numeric threshold rule lets you assign a threshold for each of ArteMon's severity levels for a monitored data item, or an arithmetic expression based on the data items in the same monitored object.

N of M Thresholds

The N of M threshold rule is similar to the numeric threshold rule. But instead of looking at the last sampled value, it looks at the previous M values. The threshold is considered crossed if the threshold value was exceeded in N or more of the previous M samples.

Smoothed Numeric Thresholds

The smoothed numeric threshold rule doesn't use the sampled value for testing if the threshold is crossed. Instead, it uses a "smoothed" value. You can smooth the values using a simple moving average, a weighted moving average, or exponential smoothing.

Boolean Expression Thresholds

Sometimes you can't use a numeric threshold because the monitored data item isn't a number, it's a string. In this case you need to use a string comparison, or a Boolean expression containing string comparisons. You may also want to use Boolean expressions based on numeric data or combinations of numeric and string data. The Boolean expression rule lets you do that. You supply a Boolean expression for one or more of the severity levels.

Quality Control Thresholds

The quality control threshold rule uses concepts from statistics to look for abnormal conditions. The rule monitors the data to determine an estimated mean and standard deviation . Thresholds are then specified by the number of standard deviations above, below, or above and below the mean. When an observed value falls outside the specified range, the notification is sent.