Amazon Web Services - Cloud Giant Hits Hard - The Something Awful Forums

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Amazon Web Services - Cloud Giant Hits Hard

«‹›61 »

lazerwolf: Dec 22, 2009; Orange and Black

I have an architecture question. I have a Django REST api running on ECS. I feel there is something off with the configuration because the instances start crashing when 1000 concurrent POST requests to the api occur. I do have autoscaling enabled but I�m not sure if the instances scale fast enough or there is something else going on.

We have a separate service using step functions that make the POST requests. These generally happen in bursts. I�m wondering if directly hitting the api is the right pattern or if there is a way to buffer these into some sort of queue that will throttle the POSTs to a more manageable rate.

# ? Sep 5, 2022 03:47

Adbot: ADBOT LOVES YOU

# ? May 21, 2024 17:07

Docjowles: Apr 9, 2009

lazerwolf posted:

I have an architecture question. I have a Django REST api running on ECS. I feel there is something off with the configuration because the instances start crashing when 1000 concurrent POST requests to the api occur. I do have autoscaling enabled but I�m not sure if the instances scale fast enough or there is something else going on.

We have a separate service using step functions that make the POST requests. These generally happen in bursts. I�m wondering if directly hitting the api is the right pattern or if there is a way to buffer these into some sort of queue that will throttle the POSTs to a more manageable rate.

I think you're on the right track. If the work can be changed to a model where requests are posted to a queue and consumers pull items off to process, that is going to be infinitely more scalable than trying to handle them synchronously. SQS and SNS are your friends here, if you're fine being AWS native.

If that is not possible, we are going to need more specifics about the failures. Error messages, what resource is being exhausted.

Docjowles fucked around with this message at 05:07 on Sep 5, 2022

# ? Sep 5, 2022 05:04

lazerwolf: Dec 22, 2009; Orange and Black

Docjowles posted:

I think you're on the right track. If the work can be changed to a model where requests are posted to a queue and consumers pull items off to process, that is going to be infinitely more scalable than trying to handle them synchronously. SQS and SNS are your friends here, if you're fine being AWS native.

If that is not possible, we are going to need more specifics about the failures. Error messages, what resource is being exhausted.

Thanks for the response. We have control over our stack. I�m wondering if this is a premature optimization. I feel our API should handle 1000 requests without breaking but who know what the true number might be. Long term this feels like a hood approach for this service.

# ? Sep 6, 2022 22:08

12 rats tied together: Sep 7, 2006

if you can pin it to 1000 requests exactly that feels like file descriptor limit or one of the other famous historic docker footguns

you're right that you shouldn't need to use a queue to serve this, but it's hard to offer much advice without knowing more about what "crashed" means. this was on ecs, right? the containers presumably went unhealthy and were reaped by the ecs scheduler, the main reasons this can happen are pid 1 exiting or a load balancer health check failing.

# ? Sep 6, 2022 22:17

lazerwolf: Dec 22, 2009; Orange and Black

My apologies, a number (~700 or so) POSTed to the API fine. Then we started seeing failed Step Function executions. The main error was the API started sending 502 responses back. I have a feeing autoscaling was not configured correctly because it looks like the CPU usage of the instances went to 100%. We always have 2 instances running and it should have been able to scale up to at least 5.

# ? Sep 6, 2022 22:39

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

So the containers aren't actually crashing/exiting, and it's a load issue? I'd check your container CPU and memory sizing, then the autoscaling thresholds and potentially make them more aggressive. If your step function is regularly scheduled you could pre-scale ECS. A hack would be to have the step function itself scale ECS.

But from a reliability stance what I'd really like to see is a backoff-and-retry mechanism in the client making the API requests. It should be able to handle transient failures like a 502 without making GBS threads the bed.

# ? Sep 7, 2022 03:35

Plorkyeran: Mar 22, 2007; To Escape The Shackles Of The Old Forums, We Must Reject The Tribal Negativity He Endorsed

Four days ago: Please migrate all your stuff over to M6I instances because they're cheaper and faster for our workloads than the M5 instances we're currently running
Today: Please stop migrating to M6I because AWS ran out of capacity.

I guess someone forgot to talk to their sales rep.

# ? Sep 9, 2022 20:20

BaseballPCHiker: Jan 16, 2006

For the life of me I cant get past this error. "An error occured (403) when calling the HeadObject operation: Forbidden"

Workflow is basically s3 bucket event notifications -> SQS -> Lambda -> Elastic.

The Lambda role has s3:GetObject, GetObjectAttributes, and ListBucket rights. The s3 bucket allows my Lambda to make GetObject, GetObjectAttributes to "arn:aws:s3:::aws-bucket/*" and ListBucket to "arn:aws:s3:::aws-bucket".

I can confirm that the objects my lambda is trying to get are actually there, they exist and are in the bucket. I have no clue at this point and am about to pull my little remaining hair out.

# ? Sep 20, 2022 20:04

Docjowles: Apr 9, 2009

Is the lambda in a different AWS account than whatever writes the objects to the bucket? You might need to mess with object ownership settings, such as turning on �bucket owner enforced�.

# ? Sep 20, 2022 20:23

Nukelear v.2: Jun 25, 2004; My optional title text

BaseballPCHiker posted:

For the life of me I cant get past this error. "An error occured (403) when calling the HeadObject operation: Forbidden"

Workflow is basically s3 bucket event notifications -> SQS -> Lambda -> Elastic.

The Lambda role has s3:GetObject, GetObjectAttributes, and ListBucket rights. The s3 bucket allows my Lambda to make GetObject, GetObjectAttributes to "arn:aws:s3:::aws-bucket/*" and ListBucket to "arn:aws:s3:::aws-bucket".

I can confirm that the objects my lambda is trying to get are actually there, they exist and are in the bucket. I have no clue at this point and am about to pull my little remaining hair out.

If you feel pretty solid on the policy side, maybe your lambda isn't using the right creds. Log out a call to sts get_caller_identity to confirm.

# ? Sep 20, 2022 20:25

BaseballPCHiker: Jan 16, 2006

Docjowles posted:

Is the lambda in a different AWS account than whatever writes the objects to the bucket? You might need to mess with object ownership settings, such as turning on �bucket owner enforced�.

The Lambda is in a different account. That gives me an avenue to go down, thank you!

Nukelear v.2 posted:

If you feel pretty solid on the policy side, maybe your lambda isn't using the right creds. Log out a call to sts get_caller_identity to confirm.

I feel like I've tripled checked the policy side. Will give that a shot as well, thank you!

# ? Sep 20, 2022 20:30

12 rats tied together: Sep 7, 2006

BaseballPCHiker posted:

The Lambda is in a different account. That gives me an avenue to go down, thank you!

Since the IAM service is account-scoped, "Cross Account" is a huge huge huge piece of extra added complexity. Definitely include it with all of your search terms, forum posts, etc.

From experience I would guess at one of two things being wrong: first one is, like Docjowles was getting at, you uploaded your object with the BUCKET_OWNER_FULL_CONTROL (nobody else has access) Canned ACL which is interfering with your policy in some way.

The other thing would be missing account ids in your various permissions policies. The Lambda function needs its own account-local role to assume, and that role needs a policy that allows access to s3 ARNs that have the bucket-account account id in them. Similarly, the bucket-account bucket policy needs to trust the Lambda account role, probably also a good idea to include the account id in the role ARN as well.

Other than that, it seems like you did everything correct, so an AWS support ticket might be in order. They're usually pretty good at debugging cross account permissions gotchas, and they have slightly more access to your stuff than we do.

# ? Sep 20, 2022 21:13

JehovahsWetness: Dec 9, 2005; bang that shit retarded

Is the bucket using a custom KMS key for encryption? If it is then that key also needs to have a resource policy that also grants access to the other account's principal. You also won't get a KMS-specific error, just the regular forbidden error.

# ? Sep 20, 2022 23:14

Ajaxify: May 6, 2009

S3 buckets have resource-based policies attached to the bucket. It needs to also allow the same actions that you are adding in your Lambda's role.

You can also look into the Policy Simulator, to see if that helps you identify the missing piece of your policy: https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_testing-policies.html

Otherwise, everything else people have said is also correct. You need appropriate permissions on both accounts for KMS as well if you're using that. You may also be running into S3 ACL issues which are a total nightmare to deal with cross-account. This might provide some more information as well: https://aws.amazon.com/premiumsupport/knowledge-center/cross-account-access-s3/

# ? Sep 21, 2022 00:12

BaseballPCHiker: Jan 16, 2006

Just an update on my issue for people at the edge of their seats.

It ended up being an KMS issue. The bucket I was trying to read from was using AWS managed SSE with the default KMS policy that entails.

JehovahsWetness posted:

Is the bucket using a custom KMS key for encryption? If it is then that key also needs to have a resource policy that also grants access to the other account's principal. You also won't get a KMS-specific error, just the regular forbidden error.

Basically what JehovahsWetness said. I really wish their was a more KMS specific error message there.

# ? Sep 27, 2022 18:04

Pyromancer: Apr 29, 2011; This man must look upon the fire, smell of it, warm his hands by it, stare into its heart

BaseballPCHiker posted:

Basically what JehovahsWetness said. I really wish their was a more KMS specific error message there.

Nope, just a pitfall you have to run into once. Just like requiring Decrypt permission on KMS key to upload files to encrypted bucket, and not expecting to need it, because you only encrypt a file.

# ? Sep 27, 2022 20:19

luminalflux: May 27, 2005

BaseballPCHiker posted:

For the life of me I cant get past this error. "An error occured (403) when calling the HeadObject operation: Forbidden"

Workflow is basically s3 bucket event notifications -> SQS -> Lambda -> Elastic.

The Lambda role has s3:GetObject, GetObjectAttributes, and ListBucket rights. The s3 bucket allows my Lambda to make GetObject, GetObjectAttributes to "arn:aws:s3:::aws-bucket/*" and ListBucket to "arn:aws:s3:::aws-bucket".

I can confirm that the objects my lambda is trying to get are actually there, they exist and are in the bucket. I have no clue at this point and am about to pull my little remaining hair out.

Cross-account access is an exercise in "why is my toddler crying" debugging. Especially with KMS-encrypted resources.
Could it be the IAM permissions on the calling role not permitting access to the bucket?
Could it be bucket policy not permitting access from that account/role?
Could it be the IAM permissions on the role not permitting access to the KMS key?
Could it be the KMS key policy role not permitting access from that account/role?

Could it just need a loving nap?

# ? Sep 27, 2022 20:56

deedee megadoodoo: Sep 28, 2000; Two roads diverged in a wood, and I, I took the one to Flavortown, and that has made all the difference.

I spent a few hours debugging an issue with an instance that wouldn�t start and the problem ended up being that the AMI was shared across accounts and KMS encrypted and the permissions for the key had changed. The error messages were zero help on that one.

deedee megadoodoo fucked around with this message at 16:15 on Sep 28, 2022

# ? Sep 28, 2022 16:12

ledge: Jun 10, 2003

Anyone else's day being ruined by the us-west-2 API gateway outage? 3 hours and counting now.

# ? Sep 28, 2022 21:12

Scrapez: Feb 27, 2004

ledge posted:

Anyone else's day being ruined by the us-west-2 API gateway outage? 3 hours and counting now.

I've been troubleshooting an issue with a new connection to an API Gateway that resides in us-west-2. It's returning 504 responses sporadically for seemingly no reason. I wonder if it's related.

Edit: Hasn't happened now for almost 3 hours so it would seem to be related. Really annoying I didn't know that outage was going on. Since it was a brand new setup, I thought it was caused by something I'd setup incorrectly.

Scrapez fucked around with this message at 00:40 on Sep 29, 2022

# ? Sep 29, 2022 00:32

Scrapez: Feb 27, 2004

I have a probably dumb question relating to Data Migration Service.

I have an Aurora Postgresql database as both my source and target. Both databases have the same schema but have different data in them. If I setup DMS to migrate and then replicate the source to the target, will it overwrite the current data in the target database or will it only add records that are unique, leaving the target db data intact with just the additions of the source db?

# ? Sep 29, 2022 00:34

Docjowles: Apr 9, 2009

ledge posted:

Anyone else's day being ruined by the us-west-2 API gateway outage? 3 hours and counting now.

Nope because all my stuff is in us-east-1 :smug:

Do not ask me how many other days have been ruined by that fact

# ? Sep 29, 2022 00:40

BaseballPCHiker: Jan 16, 2006

luminalflux posted:

Cross-account access is an exercise in "why is my toddler crying" debugging. Especially with KMS-encrypted resources.
Could it be the IAM permissions on the calling role not permitting access to the bucket?
Could it be bucket policy not permitting access from that account/role?
Could it be the IAM permissions on the role not permitting access to the KMS key?
Could it be the KMS key policy role not permitting access from that account/role?

Could it just need a loving nap?

Add config rules and SCPs to that list as well, fun times!!!

I learned a lot at least having not had to do much dev work in the past cross account.

# ? Sep 30, 2022 16:06

luminalflux: May 27, 2005

BaseballPCHiker posted:

Add config rules and SCPs to that list as well, fun times!!!

I learned a lot at least having not had to do much dev work in the past cross account.

jfc i forgot about SCPs too. We got sent down a multi-account path by a manager who had never had to deal with the consequences of it and I curse his name each time we have to deal with it.

# ? Sep 30, 2022 21:23

Strong Sauce: Jul 2, 2003; You know I am not really your father.

I'm trying to figure out amazon pricing using their calculators: https://calculator.aws/#/addService/AuroraMySQL

All I want to do is

1: Make Lambda call that may take ~4-5 minutes to complete.
2: Shove those results into Aurora/MySQL Serverless
3: Do this maybe 2-4 times a day.

I'm looking at the calculator for Aurora Serverless and its kinda freaking me out since it says it'll cost ~$44 a month using 0.5 ACUs (aurora capacity unit) which is the smallest you can choose per hour.

Does this calculation mean how much it'll cost me if I constantly make calls to their service, or does it mean if I want to keep it hot it'll cost me that much? Seems absurdly expensive if its the latter. I don't want to set this up, forget about it and get billed next month for doing nothing.

Strong Sauce fucked around with this message at 04:11 on Oct 3, 2022

# ? Oct 3, 2022 04:06

freeasinbeer: Mar 26, 2015; by Fluffdaddy

Aurora is expensive. All RDS are in aws.

Honestly dynamodb or something fun with sqlite and litestream/s3 might make more sense.

Hell s3 might just make the most sense for really cheap.

Edit: someone combined the two

https://twitter.com/__steele/status/1361917626050514944?s=46&t=C4AGR4wLWXZI7AkcvRB25Q

freeasinbeer fucked around with this message at 04:29 on Oct 3, 2022

# ? Oct 3, 2022 04:26

Docjowles: Apr 9, 2009

+1 if you don't actually need a traditional database, don't use one. DynamoDB or S3 + Athena could end up costing pennies compared to Aurora.

# ? Oct 3, 2022 05:17

Hughlander: May 11, 2005

Docjowles posted:

+1 if you don't actually need a traditional database, don't use one. DynamoDB or S3 + Athena could end up costing pennies compared to Aurora.

At work we have a large no-sql database in mongodb, with no cross session writes. I'm going to ask for a research project next year to do a proof of concept of replacing the whole thing with EFS.

# ? Oct 3, 2022 05:33

Pyromancer: Apr 29, 2011; This man must look upon the fire, smell of it, warm his hands by it, stare into its heart

Strong Sauce posted:

All I want to do is

1: Make Lambda call that may take ~4-5 minutes to complete.
2: Shove those results into Aurora/MySQL Serverless
3: Do this maybe 2-4 times a day.

I'm looking at the calculator for Aurora Serverless and its kinda freaking me out since it says it'll cost ~$44 a month using 0.5 ACUs (aurora capacity unit) which is the smallest you can choose per hour.

If you use Aurora Serverless v1 and configure it to pause on inactivity you'll only pay for storage it uses in the time it's paused, but not the hourly compute costs. Doubt calculator can factor that in, also you're likely looking at serverless v2 (since you mention 0.5 ACU) and v2 can't do that.
The downside is that when you need it to come up again it takes around 30 seconds.

# ? Oct 3, 2022 16:08

fluppet: Feb 10, 2009

Anyone else having issues with ssm sessions over ipv6?

# ? Oct 3, 2022 16:48

Love Stole the Day: Nov 4, 2012; Please give me free quality professional advice so I can be a baby about it and insult you

Heard from a former SDE3 at Amazon that the way they decide whether to use an EC2 or a Lambda internally is: "if latency > 100ms and TPS < 30k then use Lambda; else, use EC2"

# ? Oct 3, 2022 20:04

Adhemar: Jan 21, 2004; Kellner, da ist ein scheussliches Biest in meiner Suppe.

It�s a bit more nuanced than that, but pretty much, yeah. There are also more steps between EC2 and Lambda (ECS on Fargate, ECS on EC2).

# ? Oct 3, 2022 20:26

jiffypop45: Dec 30, 2011

At the end of the day its all ec2 anyway :black101:

just various abstractions, warm pools, etc between services.

# ? Oct 3, 2022 21:29

Vanadium: Jan 8, 2005

Is it okay to use the default VPC or should I always make a fresh VPC in my terraform or whatever? I need to put a few lambdas into a VPC because they need to talk to a VPC Endpoint to talk to a thing running in another account and welp it's just a lot of boilerplate isn't it.

# ? Oct 4, 2022 02:15

12 rats tied together: Sep 7, 2006

Every account should have exactly one VPC, of a suitable size for your business model and infrastructure complexity, per region. /16 is a pretty good default, but reserve an ipv4 cidr block from your corporate ipv4 address allocation matrix so you don't create routing conflicts down the line.

This strategy will scale with you up into the $100mm/month spend range in AWS without any significant problems. If permanently reserving an ipv4 cidr block and spinning up a VPC and subnets stack is too heavy of a lift for your thing, that's a really good heuristic for "this thing should live in a different account".

# ? Oct 4, 2022 02:23

Vanadium: Jan 8, 2005

Eh, they're trying to cut down on handing out routable CIDR ranges. I'm pretty sure asking for a /16 would raise eyebrows, for an actual "gonna run a bunch of computey things" account I think they just gave me a /22 or /21 or something.

So far this account is mostly metrics and IAM roles and that sort of thing, and these couple lambdas that are periodically doing things to CloudWatch alarms, based on data from this service I'm wanting to call via privatelink. Right now it calls the legacy version of the thing via API Gateway but the shiny new version isn't going to have an API Gateway.

Having a tiny VPC just for these lambdas and nothing else seems least likely to cause conflicts down the line, and in the worst case I can spin up this stuff somewhere else and delete the VPC again. If they get a lot of scope creep the lambdas should probably move to a better platform than this pile of terraform, too, but I shouldn't invest a lot of time in that right now.

# ? Oct 4, 2022 02:53

Methanar: Sep 26, 2013; by the sex ghost

Vanadium posted:

Eh, they're trying to cut down on handing out routable CIDR ranges. I'm pretty sure asking for a /16 would raise eyebrows, for an actual "gonna run a bunch of computey things" account I think they just gave me a /22 or /21 or something.

So far this account is mostly metrics and IAM roles and that sort of thing, and these couple lambdas that are periodically doing things to CloudWatch alarms, based on data from this service I'm wanting to call via privatelink. Right now it calls the legacy version of the thing via API Gateway but the shiny new version isn't going to have an API Gateway.

Having a tiny VPC just for these lambdas and nothing else seems least likely to cause conflicts down the line, and in the worst case I can spin up this stuff somewhere else and delete the VPC again. If they get a lot of scope creep the lambdas should probably move to a better platform than this pile of terraform, too, but I shouldn't invest a lot of time in that right now.

Having multiple VPCs is basically always a bad idea because of cross-vpc billing. Just make one mega VPC. and make it big enough Forever.

# ? Oct 4, 2022 03:01

crazypenguin: Mar 9, 2005; nothing witty here, move along

VPC peering doesn't have a cost anymore I think https://aws.amazon.com/about-aws/whats-new/2021/05/amazon-vpc-announces-pricing-change-for-vpc-peering/

The only use case I've had for multiple VPCs in the same account is associating a Route 53 Private Zone for `.` as a tool for allowing internal VPC DNS to work, but suppress public DNS resolution in an "air gapped" VPC. Neat trick, but the association is VPC-wide, so to have some things WITH public DNS in there, you have to peer two VPCs together.

# ? Oct 4, 2022 04:12

Methanar: Sep 26, 2013; by the sex ghost

crazypenguin posted:

VPC peering doesn't have a cost anymore I think https://aws.amazon.com/about-aws/whats-new/2021/05/amazon-vpc-announces-pricing-change-for-vpc-peering/

The only use case I've had for multiple VPCs in the same account is associating a Route 53 Private Zone for `.` as a tool for allowing internal VPC DNS to work, but suppress public DNS resolution in an "air gapped" VPC. Neat trick, but the association is VPC-wide, so to have some things WITH public DNS in there, you have to peer two VPCs together.

huh. I somehow missed this change. Because we used to pay out of the nose from some questionable multiple VPC architecture.

# ? Oct 4, 2022 04:32

Adbot: ADBOT LOVES YOU

# ? May 21, 2024 17:07

freeasinbeer: Mar 26, 2015; by Fluffdaddy

Brain dump:

EKS is greedy with IPs so sometimes a /22 or /23 doesn�t cut it.

Also what is _not_ free is transit gateways, so a bunch of accounts with a bunch of VPCs is a great way to spend a ton of money.

VPC sharing works, except like nothing works with it, you can�t run EKS, RDS or anything that peers from amazon.

VPCs can only route at max 40-50k IPs.

# ? Oct 9, 2022 05:09

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Amazon Web Services - Cloud Giant Hits Hard

«‹›61 »