Amazon Web Services - Cloud Giant Hits Hard - The Something Awful Forums

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Amazon Web Services - Cloud Giant Hits Hard

Necronomicon: Jan 18, 2004

I've got a question about monitoring my ECS services and I'm hoping y'all can help, apologies if this is overly simple, I just don't have a huge amount of experience with alarms/monitoring.

I have three ECS services (a queue service, an admin portal, and a customer-facing portal) and a CloudWatch alarm that compares the number of running tasks to the desired number of tasks for each service. If the number of running tasks dips below the desired number for a certain amount of time, it notifies an SNS topic, and invokes a Lambda function that sends an alarm to a Slack channel. The alarms work fine for the two portals, but the queue service is giving me trouble. Whenever we deploy, the service will shut down its one task (since there can only be one at a time), and then redeploy and pick up queue items that piled up in the meantime. It normally takes about 5-10 minutes. So while my task will report that it stopped, it didn't technically "fail". I'm having trouble distinguishing between a "stopped" task (which is expected during the deploy) and a "failed" task.

There are probably some underlying architectural issues here but I'm being told they're not able to be changed and I have to just make this work. I'm using Container Insights and the RunningTaskCount metric, but I think I'm just looking at this from the wrong angle. Does anybody have any advice?

Edit: From research, I think I probably need to create an EventBridge rule, something like...

code:

{
  "source": ["aws.ecs"],
  "detail-type": ["ECS Task State Change"],
  "detail": {
    "clusterArn": "arn:aws:ecs:us-east-1:xxxxxxx:cluster/$cluster
    "containers": [{
      "containerArn": "arn:aws:ecs:us-east-1:xxxxxxx:container/$container",
      "lastStatus": "RUNNING",
      "name": "test",
      "taskArn": "arn:aws:ecs:us-east-1:xxxxxxx:task/$task"
    }],
    "eventType": ["WARN", "ERROR"]
  }
}

...pointed at my SNS topic, where this particular cluster contains basically just the one service that I need the specific alarm for. I'm still poking around with this and will need to trigger some deliberate task failures to test.

Necronomicon fucked around with this message at 21:39 on Oct 10, 2023

# ¿ Oct 10, 2023 19:51

Adbot: ADBOT LOVES YOU

# ¿ May 22, 2024 13:46

Necronomicon: Jan 18, 2004

Answering my own question here in case it helps anyone. This is the EventBridge Rule I created to catch the event I was looking for:

code:

{
  "detail": {
    "group": ["service:$serviceName"],
    "lastStatus": ["STOPPED"],
    "stoppedReason": [{
      "anything-but": {
        "prefix": "Scaling activity initiated by (deployment"
      }
    }]
  },
  "detail-type": ["ECS Task State Change"],
  "source": ["aws.ecs"]
}

Shamelessly stolen from here.

# ¿ Oct 11, 2023 21:22

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Amazon Web Services - Cloud Giant Hits Hard