|
I've got a question about monitoring my ECS services and I'm hoping y'all can help, apologies if this is overly simple, I just don't have a huge amount of experience with alarms/monitoring. I have three ECS services (a queue service, an admin portal, and a customer-facing portal) and a CloudWatch alarm that compares the number of running tasks to the desired number of tasks for each service. If the number of running tasks dips below the desired number for a certain amount of time, it notifies an SNS topic, and invokes a Lambda function that sends an alarm to a Slack channel. The alarms work fine for the two portals, but the queue service is giving me trouble. Whenever we deploy, the service will shut down its one task (since there can only be one at a time), and then redeploy and pick up queue items that piled up in the meantime. It normally takes about 5-10 minutes. So while my task will report that it stopped, it didn't technically "fail". I'm having trouble distinguishing between a "stopped" task (which is expected during the deploy) and a "failed" task. There are probably some underlying architectural issues here but I'm being told they're not able to be changed and I have to just make this work. I'm using Container Insights and the RunningTaskCount metric, but I think I'm just looking at this from the wrong angle. Does anybody have any advice? Edit: From research, I think I probably need to create an EventBridge rule, something like... code:
Necronomicon fucked around with this message at 21:39 on Oct 10, 2023 |
# ¿ Oct 10, 2023 19:51 |
|
|
# ¿ May 22, 2024 13:46 |
|
Answering my own question here in case it helps anyone. This is the EventBridge Rule I created to catch the event I was looking for:code:
|
# ¿ Oct 11, 2023 21:22 |