How to stop AWS CloudWatch UnHealthHostCount false alarms?

2018-06-28 16:06:51

We get this message (via email) several times a day:

ALARM: "elb-production-UnHealthHostCount" in US - N. Virginia

You are receiving this email because your Amazon CloudWatch Alarm "elb-production-UnHealthHostCount" in the US - N. Virginia region has entered the ALARM state, because "Threshold Crossed: 1 datapoint (0.2) was greater than the threshold (0.0)." at "Thursday 21 January, 2016 17:39:39 UTC".

View this alarm in the AWS Management Console: https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#s=Alarms&alarm=elb-production-UnHealthHostCount

Alarm Details: - Name: elb-production-UnHealthHostCount - Description: - State Change: OK -> ALARM - Reason for State Change: Threshold Crossed: 1 datapoint (0.2) was greater than the threshold (0.0). - Timestamp: Thursday 21 January, 2016 17:39:39 UTC - AWS Account: 1234567890

Threshold: - The alarm is in the ALARM state when the metric is GreaterThanThreshold 0.0 for 60 seconds.

Monitored Metric: - MetricNamespace: AWS/ELB - MetricName: UnHealthyHostCount - Dimensions: [LoadBalancerName = production] - Period: 60 seconds - Statistic: Average - Unit: not specified

State Change Actions: - OK: - ALARM: [arn:aws:sns:us-east-1:1234567890:DevOps] - INSUFFICIENT_DATA:

However, upon viewing our nginx log files, it appears that AWS was able to contact each of our servers around the time the alarm was "set off". In other words, our ec2 instances returned 200 on each request to /healthcheck around Thursday 21 January, 2016 17:39:39 UTC.

AWS seems to check each of our instances every 30 seconds or so.

Has anyone experienced this issue? If so, what have you done about it?

I've updated a few settings from ...

Whenever: UnHealthyHostCount > 0

Statistic: Average

... to ...

Whenever: UnHealthyHostCount >= 1

Statistic: Maximum

I will update this answer if my problem continues to occur.

UPDATE:

The problem continued to occur :/

I've updated one more setting on my current UnHealthyHostCount alarm ...

for 1 consecutive period(s)

... to ...

for 2 consecutive period(s)

... and I've created a new alarm to track if multiple servers are down for a single period ...

在这里输入图像描述

I will update this answer if my problem continues to occur.

链接地址: http://www.djcxy.com/p/80080.html

上一篇: EC2Config +将日志和指标转发到cloudwatch

下一篇: 如何阻止AWS CloudWatch UnHealthHostCount虚假警报？