Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

cloudwatch 3.3 alarm rules

chris grzegorczyk edited this page Jun 17, 2013 · 1 revision

Table of Contents

Overview

Amazon Web Services (AWS) provides an Alarm function in its Cloud Watch application that monitors the state of metrics can perform certain actions if a metric value exceeds a threshold for an extended period of time, such as send an SNS action or execute an Auto Scaling policy action. Actions may also occur in other states. In implementing an AWS compatible Cloud Watch service, documentation was sufficient to implement the standard use case (threshold exceeded for a specified number of periods) but there were certain other operations, most notably how to specifically define INSUFFICIENT_DATA that required a bit of experimentation and some assumptions to be made to complete the implementation. These assumptions are based on open questions and experimentation. The document below will discuss the open questions and show the final rule set that is used to evaluate alarm states in the Eucalyptus Cloud Watch Implementation.

Definitions

An alarm is a named entity that watches the value of a metric over time. An alarm consists of a name, a Cloud Watch metric (and all of its associated fields), an evaluation rule (which will be expanded on below), a number of periods to evaluate over, and a period length.

The evaluation rule is a rule of the form:

The STATISTIC of the metric is COMPARISON_OPERATOR THRESHOLD_VALUE.
Valid statistics are Average, Minimum, Maximum, SampleCount, and Sum.

Valid comparison operators are GreaterThan, LessThan, GreaterThanOrEqualTo and LessThanOrEqualTo.

A concrete evaluation rule is

The Average of the metric is Greater Than 30.
These rules are applied over a single period.

AWS defines three state values for an alarm. These values are (per the AWS spec)

  • OK - The value of the statistic does not exceed the threshold.
  • ALARM - The value of the statistic does exceed the threshold.
  • INSUFFICIENT_DATA - There is no data for the statistic. This may be because the alarm has just started or the metric does not exist.
One must distinguish between the evaluation rule result for a single period and the global state of the alarm. An alarm is set to the ALARM state if for a certain number of consecutive periods the evaluation rule shows that the threshold value has been exceeded. Thus, an alarm may be in the OK state even if the previous period value has exceeded the threshold, as shown in the graph below. (The number of evaluation periods is 5, and the threshold is 30).

File:alarm-rules-graph-1.png

Discussion Questions

What the above graph shows is the normal ALARM case. An alarm should be in the ALARM state if the threshold has been exceeded for the specified number of consecutive periods. But this is only part of the story. Here are some questions that need to be answered to completely determine what the current alarm state "should" be.

  1. What are the period boundaries for evaluation? (i.e. at time t, what is the start and end time of the periods to be evaluated?)
  2. When exactly should we change the state of the alarm to OK?
  3. When exactly should we change the state of the alarm to INSUFFICIENT_DATA?
  4. When exactly should we change the state of the alarm to ALARM? We know the case where we have exceeded the threshold for a certain number of periods, but is that the only case?)
  5. How often do we evaluate the state of the alarm?
  6. What happens if the alarm is currently in the ALARM state, and we receive some new data points with timestamps in the past, which if re-evaluated would cause previous periods to no longer exceed the threshold?
  7. When an alarm state is evaluated, does the "old" state play a part? This could be an issue as there is an API method called "setAlarmState()" which allows for a temporary setting of the alarm state.
We decided to see what AWS does for the above questions with a series of experiments.

Experiment 1 (OK DATA)

The first experiment sent some data for a metric that was "below" the threshold for a certain period of time. Our suspicion is that this should cause the alarm to go into an OK state. We stop putting any data in at all, and we expect eventually the alarm to go back into the INSUFFICIENT_DATA state. Here is our experiment:

  • Time 00:00 create an alarm (3 evaluation periods, period 300 seconds threshold 30.0)
  • Time 03:00 put metric data, value of 20 (keep doing this every 30 seconds)
  • Time 17:30 put last metric data value of 20 in.
Once the data was entered, alarm state was observed.
  • Time 00:00 alarm state was INSUFFICIENT_DATA
  • Time 03:35 alarm state changed to OK
  • Time 43:35 alarm state changes to INSUFFICIENT_DATA.
Here is a graph of the experiment. File:alarm-rules-graph-2.png

The green brackets indicate the length of a single period (5 minutes or 300 seconds) and the length of the "evaluation window", the total length of time the threshold must be exceeded for the ALARM state to be triggered. There are a few things to note from this graph. First, the alarm state was changed to OK within 1 minute of entering data. This is consistently true for different experiments as well. This suggests that alarm evaluation is not done once per period, but once per minute. This is our first observation.

Observation 1 -- Alarm evaluation is done once per minute.
Another point of interest on the graph is the time at which the final INSUFFICIENT_DATA state is reached. If having one period of insufficient data within the evaluation window was sufficient to trigger the INSUFFICIENT_DATA state, it would have triggered much earlier, near the 23 minute mark. What's more surprising is that even having the entire evaluation window contain no data is not enough to trigger the INSUFFICIENT_DATA state, it takes longer than that. (Approximately 10 more minutes). Doing this same experiment with different period lengths shows that it takes on average 2 periods longer than the evaluation window, or 5 minutes, whichever is greater, before the INSUFFICIENT_DATA state is triggered. This leads us to our second observation.
Observation 2 -- Transition to the INSUFFICIENT_DATA state takes 2 periods or 5 minutes longer (whichever is greater) than the evaluation window from the point that data is no longer received.
So we have discovered a couple of things about the INSUFFICIENT_DATA state. Let's run another experiment with an alarm.

Experiment 2 (ALARM DATA)

The second experiment sent some data for a metric that was "above" the threshold for a certain period of time. Our suspicion is that this should cause the alarm to go into an ALARM state. We stop putting any data in at all, and we expect eventually the alarm to go back into the INSUFFICIENT_DATA state. Here is our experiment:

  • Time 00:00 create an alarm (3 evaluation periods, period 300 seconds threshold 30.0)
  • Time 03:00 put metric data, value of 40 (keep doing this every 30 seconds)
  • Time 18:30 put last metric data value of 40 in.
Once the data was entered, alarm state was observed.
  • Time 00:00 alarm state was INSUFFICIENT_DATA
  • Time 09:09 alarm state changed to ALARM
  • Time 44:09 alarm state changes to INSUFFICIENT_DATA.
Here is a graph of the experiment. File:alarm-rules-graph-3.png

The INSUFFICIENT_DATA case is almost identical as in the previous experminet, taking 10 minutes longer than the evaluation window for all the alarm data to be "cleared out" before it returns to the INSUFFICIENT_DATA state. One thing to note here is that for a large portion of this experiment the alarm was in the ALARM state, even though it was not strictly true that the previous 3 intervals had data that exceeded the threshold in it. In addition, the first ALARM state was triggered just over 6 minutes after the alarm data was put in. This seems counter-intuitive, as we don't have 3 periods of data that has exceeded the threshold. But if we look closely, we can construct such a scenario. Let's divide the time into 3 intervals.

  • From one minute before alarm creation, to 04:00. There is a data point at 03:00, so that first interval has exceeded the threshold.
  • From 04:00 to 09:00 there are several data points that exceed the threshold, and the average does as well.
  • From 09:00 to 14:00 we already have one data point (09:00). That is the only data point, but the average exceeds the threshold.
That is three intervals. We have learned the following.
Observation 3 -- Transition to the INSUFFICIENT_DATA state still takes 2 periods or 5 minutes longer (whichever is greater) than the evaluation window from the point that data is no longer received, even if the previous data was ALARM data.
Observation 4 -- It is not strictly necessary for there to be ALARM data in the previous 3 consective periods (or # of evaluation periods in general) for the ALARM state to "stay" on.
Observation 5 -- One data point in an interval is sufficient for that whole interval to be considered "exceeded" or "not exceeded" the threshold.
Observation 6 -- Evaluation periods may include data in the future. This observation will not hold true for Eucalyptus.
It was a bit of a stretch to get 3 periods out of the above to trigger the alarm, so to see if it is general policy to trigger the alarm early, we tried another experiment.

Experiment 3 (ALARM DATA (LONGER))

The third experiment sent some data for a metric that was "above" the threshold for a certain period of time, checking to see how long it took to go into the ALARM_STATE. Here is our experiment:

  • Time 00:00 create an alarm (5 evaluation periods, period 300 seconds threshold 30.0)
  • Time 03:00 put metric data, value of 40 (keep doing this every 30 seconds)
  • Time 28:30 put last metric data value of 40 in.
Once the data was entered, alarm state was observed.
  • Time 00:00 alarm state was INSUFFICIENT_DATA
  • Time 18:36 alarm state changed to ALARM
  • Time 64:36 alarm state changes to INSUFFICIENT_DATA.
Here is a graph of the experiment. File:alarm-rules-graph-4.png

In this case, it takes a little over 15 minutes 30 seconds to trigger the alarm, but this can also be cut into 5 intervals, the same way as before. However, this combined with the last experiment shows us the following.

Observation 7 -- If the previous set of consecutive periods contain both ALARM and INSUFFICIENT_DATA periods, there are situations where the alarm state is INSUFFICIENT_DATA (in the beginning) and those that are ALARM (at the end). Having the alarm be in the OK state appears to require an OK interval in the previous interval evaluation.
We have seen that a combination of INSUFFICIENT_DATA and ALARM data in intervals will make the alarm state either ALARM or INSUFFICIENT_DATA. Here are some of the configurations. (Assuming 3 evaluation periods)
  • (INSUFFICIENT_DATA, INSUFFICIENT_DATA, INSUFFICIENT_DATA) -> INSUFFICIENT_DATA (assuming previous data is also INSUFFICIENT_DATA)
  • (INSUFFICIENT_DATA, INSUFFICIENT_DATA, ALARM) -> INSUFFICIENT_DATA (assuming previous data is also INSUFFICIENT_DATA)
  • (ALARM, ALARM, ALARM) -> ALARM (this is by definition)
  • (ALARM, ALARM, INSUFFICIENT_DATA) -> ALARM (if the previous data is also ALARM)
This suggests that INSUFFICIENT_DATA can be a placeholder for an "ALARM" state during the evaluation, assuming we have seen an ALARM period in the past.

We have not tried a trial with OK, INSUFFICIENT_DATA, and ALARM periods in our data set. We will try this next.

Experiment 4 (MIXED DATA)

The fourth experiment sent some data for a metric that was "above" the threshold, some what was "below" the threshold, and periods of no data at all. We may see all alarm states, but we don't have any predictions.

Here is our experiment:

  • Time 00:00 create an alarm (3 evaluation periods, period 300 seconds threshold 30.0)
  • Time 03:00 put metric data, value of 20 (keep doing this every 30 seconds)
  • Time 07:30 put last metric data value of 20 in.
  • Time 08:00 put metric data, value of 40 (keep doing this every 30 seconds)
  • Time 12:30 put last metric data value of 40 in.
  • Time 13:00 put metric data, value of 20 (keep doing this every 30 seconds)
  • Time 17:30 put last metric data value of 20 in.
  • Time 18:00 put no data in for a while
  • Time 32:30 stop putting no data in (start up in 30 seconds)
  • Time 33:00 put metric data, value of 40 (keep doing this every 30 seconds)
  • Time 37:30 put last metric data value of 40 in.
  • Time 38:00 put no data in for a while
  • Time 42:30 stop putting no data in (start up in 30 seconds)
  • Time 43:00 put metric data, value of 40 (keep doing this every 30 seconds)
  • Time 47:30 put last metric data value of 40 in.
  • Time 48:00 put no data in for a while
  • Time 52:30 stop putting no data in (start up in 30 seconds)
  • Time 53:00 put metric data, value of 20 (keep doing this every 30 seconds)
  • Time 57:30 put last metric data value of 20 in.
  • Time 58:00 stop putting any data in
  • Time 73:00 delete alarm
Once the data was entered, alarm state was observed.
  • Time 00:00 alarm state was INSUFFICIENT_DATA
  • Time 03:57 alarm state changed to OK
  • Time 42:57 alarm state changes to ALARM
  • Time 53:57 alarm state changed to OK
  • Time 73:57 alarm deleted
Here is a graph of the experiment. File:alarm-rules-graph-5.png

A couple of things to note here. First, the final INSUFFICIENT_DATA state doesn't occur, would probably occur somewhere around 83:00 or 84:00 (Evaluation window of 15 minutes + 2 more periods of 10 minutes = 25 minutes after 58:00, last data point). This is consistent with what we have seen so far. In addition, adding OK data will cause the alarm state to change to OK within a minute. In addition, the 42:57 ALARM state change consists of ALARM and INSUFFICIENT_DATA periods. This leads to the following two observations.

Observation 8 -- When OK, INSUFFICIENT_DATA, and ALARM periods all exist within an evaluation window, the alarm state will be OK. Specifically, if there is even one OK period in the evaluation window, the state value will be OK.
Observation 9 -- Combinations of INSUFFICIENT_DATA and ALARM periods can trigger an ALARM state. We can consider an INSUFFICIENT_DATA period to be equivalent to an ALARM period if it is preceeded by an alarm period.
This gives us just about enough data to evaluate our rules. Our final experiment deals with retroactive data.

Experiment 5 (RETROACTIVE DATA)

Our fifth experiment was to see what happened at AWS when we triggered an alarm and added old data afterwards.

We created an alarm against metric "Experiment-1", theshold: 30, number of evaluation periods: 5, period: 120 (seconds), statistic: average. This means that the alarm state should be set to ALARM after 5 periods if the threshold was exceeded. We sent the following data

Here is our experiment:

  • Time 00:00 create an alarm (5 evaluation periods, period 120 seconds threshold 30.0)
  • Time 00:00 put metric data, value of 40 (keep doing this every 30 seconds)
  • Time 14:00 put a bunch of metric data in, post-dated, a value of 10.0 (from 00:00-10:00 spaced every 30 seconds). This will make the average for these periods 25.0)
  • Time 14:00 continue put metric data, value of 40, present time (every 30 seconds)
  • Time 18:00 put last data point value of 40.0 in
What should happen here is that if old data is used in the evaluation, around 10:00 the state of the alarm should be set to ALARM, but at 14:00 or so, those old intervals should no longer exceed the threshold, so the state of the alarm should no longer be set to alarm.

Once the data was entered, alarm state was observed.

  • Time 00:00 Alarm Created and set to INSUFFICIENT_DATA state
  • Time 07:43 Alarm set to ALARM state
  • Time 14:43 Alarm set to OK state
  • Time 29:43 Alarm set to ALARM state
  • Time 35:43 Alarm set to INSUFFICIENT_DATA state
Here is a graph of the experiment. File:alarm-rules-graph-6.png

The 07:43 initial ALARM state is as before, we can divvy up the intervals in such a way as to have 5 intervals that are in the ALARM state. In addition, within a minute of the retroactive data being added, the alarm state is re-evaluated, and set to the OK state. Notice in addition, that the ALARM state is set very late. The OK data had to fall off the evaluation window. The INSUFFICIENT_DATA evaluation is the same as before. We only have one real observation to make here.

Observation 9 -- Alarm state is evaluated every minute, and data is not cached. New retroactive values are taken into account.

CONCLUSION

With the data described above, these are the rules that Eucalyptus determined that most match the AWS case and that are internally consistent.

  1. Alarm state is evaluated once per minute.
  2. The global alarm state is completely determined by the alarm state of the previous several periods, and not by the alarm's previous global state.
  3. At any point t, given a period length of p, and a number of evaluation periods of n, the evaluation window is defined to be from t' - pn to t', where t' is t rounded down to the lowest minute. The period start times are then t' - np, t' - (n-1)p, t' - (n-2)p, ..., t'-p. As an example, suppose the current time is 05:05:03, our period length is 120 seconds, and our number of evaluation periods is 5. In this case n is 5, p is 120, t' is 05:05:00. Our window is from 04:55:00 to 05:05:00 and our period start times are 04:55:00, 04:57:00, 04:59:00, 05:01:00, 05:03:00. In the next minute all of our period start times will shift forward one minute.
  4. In addition to our n evaluation periods, we add a buffer, a number of periods prior to our n periods. (Typically 2 periods, or 5 minutes, whichever is greater, but always a whole number of periods). Specifically, if p is 60 seconds, we add 5 periods. If p is 120 seconds, we add 3 periods. If p is 180 seconds or higher we add two periods. This is used for "smoothing over", which will be described next.
  5. Starting with the beginning of the buffer period, we "smooth over" any periods that are considered INSUFFICIENT_DATA. If a given period evaluates to INSUFFICIENT_DATA, it's smoothed over value is the value of the closest non-INSUFFICIENT_DATA period before it in time it can find in the evaluation window and the buffer. For example, if the oldest period in the buffer is evaluated as OK, and the rest of the following periods are INSUFFICIENT_DATA, all the periods afterwards will have a smoothed over value of "OK". A period may have a smoothed-over state of INSUFFICIENT_DATA if all previous periods have an actual state of INSUFFICIENT_DATA.
  6. With the "smoothed-over" values taken into account, if there is a single OK period in the evaluation window (no longer including the buffer but having used the buffer to compute the "smoothed over" values), the global state is OK. If every period in the evaluation window has a smoothed over value of "ALARM" the global state is ALARM. Otherwise the global state is INSUFFICIENT_DATA.

RULES (a bit more code-like)

  1. Let n be the number of evaluation periods. Let m be the number of "buffer periods (typically 2).
  2. Create a state array a[0]..a[n+m-1], a[0] being the alarm state of the oldest period in the evaluation window + the buffer .
  3. Create a second array b[0]..b[n+m], and calculate the value of b[i] as follows. b[i] = a[i] if a[i] != INSUFFICIENT_DATA. if a[i] == INSUFFICIENT_DATA, b[i] = a[k] where k is the largest value < i such that a[k] != INSUFFICIENT_DATA if k exists, otherwise b[i] = INSUFFICIENT_DATA
  4. The value of the global array state = OK if at least one of b[m]...b[n+m-1] == OK, ALARM if all of b[m]...b[n+m-1] == ALARM, INSUFFICIENT_DATA otherwise.

PREVIOUS ALARM STATE CONSIDERATIONS

The global state of the alarm in the previous alarm evaluation is generally not used for current alarm evaluation except in the following situation. Actions are triggered on state change, so if the previous state is different than what the new state has been evaluated as, the appropriate actions will be taken. In addition, some actions need to take place once per period. (For example, autoscaling actions). If a state change has not occurred, the last time an action has been taken against the current state is noted, and if it has been more than a period, the actions that need to be acted on once per period (autoscaling actions) will be executed again.




Clone this wiki locally