Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Necronomicon
Jan 18, 2004

I've got a question about monitoring my ECS services and I'm hoping y'all can help, apologies if this is overly simple, I just don't have a huge amount of experience with alarms/monitoring.

I have three ECS services (a queue service, an admin portal, and a customer-facing portal) and a CloudWatch alarm that compares the number of running tasks to the desired number of tasks for each service. If the number of running tasks dips below the desired number for a certain amount of time, it notifies an SNS topic, and invokes a Lambda function that sends an alarm to a Slack channel. The alarms work fine for the two portals, but the queue service is giving me trouble. Whenever we deploy, the service will shut down its one task (since there can only be one at a time), and then redeploy and pick up queue items that piled up in the meantime. It normally takes about 5-10 minutes. So while my task will report that it stopped, it didn't technically "fail". I'm having trouble distinguishing between a "stopped" task (which is expected during the deploy) and a "failed" task.

There are probably some underlying architectural issues here but I'm being told they're not able to be changed and I have to just make this work. I'm using Container Insights and the RunningTaskCount metric, but I think I'm just looking at this from the wrong angle. Does anybody have any advice?

Edit: From research, I think I probably need to create an EventBridge rule, something like...
code:
{
  "source": ["aws.ecs"],
  "detail-type": ["ECS Task State Change"],
  "detail": {
    "clusterArn": "arn:aws:ecs:us-east-1:xxxxxxx:cluster/$cluster
    "containers": [{
      "containerArn": "arn:aws:ecs:us-east-1:xxxxxxx:container/$container",
      "lastStatus": "RUNNING",
      "name": "test",
      "taskArn": "arn:aws:ecs:us-east-1:xxxxxxx:task/$task"
    }],
    "eventType": ["WARN", "ERROR"]
  }
}
...pointed at my SNS topic, where this particular cluster contains basically just the one service that I need the specific alarm for. I'm still poking around with this and will need to trigger some deliberate task failures to test.

Necronomicon fucked around with this message at 21:39 on Oct 10, 2023

Adbot
ADBOT LOVES YOU

Necronomicon
Jan 18, 2004

Answering my own question here in case it helps anyone. This is the EventBridge Rule I created to catch the event I was looking for:

code:
{
  "detail": {
    "group": ["service:$serviceName"],
    "lastStatus": ["STOPPED"],
    "stoppedReason": [{
      "anything-but": {
        "prefix": "Scaling activity initiated by (deployment"
      }
    }]
  },
  "detail-type": ["ECS Task State Change"],
  "source": ["aws.ecs"]
}
Shamelessly stolen from here.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply