How to stop AWS CloudWatch UnHealthHostCount false alarms?
We get this message (via email) several times a day:
ALARM: "elb-production-UnHealthHostCount" in US - N. Virginia
You are receiving this email because your Amazon CloudWatch Alarm "elb-production-UnHealthHostCount" in the US - N. Virginia region has entered the ALARM state, because "Threshold Crossed: 1 datapoint (0.2) was greater than the threshold (0.0)." at "Thursday 21 January, 2016 17:39:39 UTC".
View this alarm in the AWS Management Console: https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#s=Alarms&alarm=elb-production-UnHealthHostCount
Alarm Details: - Name: elb-production-UnHealthHostCount - Description: - State Change: OK -> ALARM - Reason for State Change: Threshold Crossed: 1 datapoint (0.2) was greater than the threshold (0.0). - Timestamp: Thursday 21 January, 2016 17:39:39 UTC - AWS Account: 1234567890
Threshold: - The alarm is in the ALARM state when the metric is GreaterThanThreshold 0.0 for 60 seconds.
Monitored Metric: - MetricNamespace: AWS/ELB - MetricName: UnHealthyHostCount - Dimensions: [LoadBalancerName = production] - Period: 60 seconds - Statistic: Average - Unit: not specified
State Change Actions: - OK: - ALARM: [arn:aws:sns:us-east-1:1234567890:DevOps] - INSUFFICIENT_DATA:
However, upon viewing our nginx log files, it appears that AWS was able to contact each of our servers around the time the alarm was "set off". In other words, our ec2 instances returned 200 on each request to /healthcheck
around Thursday 21 January, 2016 17:39:39 UTC.
AWS seems to check each of our instances every 30 seconds or so.
Has anyone experienced this issue? If so, what have you done about it?
I've updated a few settings from ...
... to ...
I will update this answer if my problem continues to occur.
UPDATE:
The problem continued to occur :/
I've updated one more setting on my current UnHealthyHostCount alarm ...
for 1 consecutive period(s)
... to ...
for 2 consecutive period(s)
... and I've created a new alarm to track if multiple servers are down for a single period ...
I will update this answer if my problem continues to occur.
链接地址: http://www.djcxy.com/p/80080.html