Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Hughmoris posted:

I am now a certified AWS Solutions Architect - Associate! :yayclod:

Now, to figure out my next steps. My current position has me loosely related to data and security work. So, maybe committing to better learning the AWS Databases, Data Analytics, or Security domains?

The end goal being a position where I get to solve interesting problems and make lots of money.

Databases and security are probably the most relevant in my experience, but you can obviously make anything work.

you really do need to understand databases eventually, and security’s important for obvious reasons. I’ll also take a moment here to recommend “Designing Data Intensive Applications”, which I’ve still only managed to make it a third of the way through bc it’s dense as hell, but it really does teach you some foundational principles underlying your data storage and retrieval options. Not relevant to AWS certs, but deeply relevant to understanding the data needs of any given service.

The Iron Rose fucked around with this message at 21:38 on May 24, 2022

Adbot
ADBOT LOVES YOU

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
for the love of god do not use AWS’ managed gateway service. It’s insanely expensive and you can do the same thing for a fraction of the price by running your own NAT instances.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

luminalflux posted:

The issue I have with "you can just..." is that yeah, sure, I can go create my own NAT instances. I'd also have to keep them up to date and patched. I'd also have to deal with failovers or potential outages if the instance suddenly goes away. All this I don't have to deal with by using NAT gateways

Corey Quinn keeps beating this drum and I'm convinced he's never had to run NAT instances in produciton.

spot instances and ASGs pulling from latest with ILBs solve these problems nicely. Just because you’re running your own hardware doesn’t mean it can’t be ephemeral also

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

luminalflux posted:

When you say "ILB", do you mean running an NLB in front of the NAT instances?

Yes, but I’ve done more digging and apparently this isn’t possible, so it’s back to reassigning a pool of ENIs annoyingly. Which I would personally do with eventbridge listening to startup events for the relevant instances and using a lambda to assign the ENI with a static IP. Insanely quick and you’ll attach within < 10s of startup.

The Iron Rose fucked around with this message at 19:28 on Jun 3, 2022

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

luminalflux posted:

That sounds like a huge rube goldberg machine to build for saving $0.045 per GB, and reading that makes me feel like something will break and leave me with no egress traffic for too long. Sure, it's not nothing but instead it's easier to go down the route of private pricing / EDP and negotiate this down.

Eventbridge used cloudwatch events under the hood and has 99.99% availability vs NAT gateway’s 99.9%. It’s definitely more work, and a more complex architecture, but saving on the 4.5c/hour and the $0.045/GB egress adds up very quickly at scale.

Or just manage the EC2 instance like you would any other service that needs a full VM (which we all have unfortunately).

At the end of the day if it’s not your budget, it’s not your budget. But it can make a big difference if you’re serving a lot of content. There are other ways to alleviate that (CDNs!), but it’s an easy win that can save lots of money with about 20 lines of Python.

The Iron Rose fucked around with this message at 20:37 on Jun 3, 2022

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
This is a perhaps a silly question, but is this not something you can do with a reverse proxy, ingress controller, or service-service authentication via an API gateway?

I’m instinctively leery of running a service oriented architecture on a fleet of full VMs, though obviously it’s perfectly possible. The need for reserved IPs (for IP based allowlisting) is also a bit of a red flag.

If you really need to stick with VMs, eventbridge is your friend if you can detect the relevant CRUD event on a given resource. From there you can trigger lambdas to allocate/release/apply your ENI/SG changes, add queueing with SQS, have fun.

If you can swing it though I’d really try to get away from running full VMs, even if they’re all ephemeral spot instances.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
You’re gonna learn a lot but this is probably not the right choice for your employer

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Docjowles posted:

I am falling out of love with terraform more all the time. It was a revelation when it came out and I was a die hard Terraform fan for quite a while. But even after years of improvement and passing 1.0, MAN is HCL still kind of awkward and lovely to work with. Any kind of non trivial loops or conditionals are just the most tortured crap. You can “fix” it by wrapping it with something else but some point, I would just rather be using something else entirely than papering over shortcomings.

I haven’t played with it much yet but I know a number of CDK true believers. Haven’t met any Pulumi users in the wild.

God yes. Terraform was revolutionary but it’s very quickly becoming too cumbersome to use. New and better abstractions are badly needed.

The main benefit of it is that it’s not really coding, so it’s not intimidating for devops peeps who don’t know how to code and it’s incredibly accessible as a result. That’s to its credit, but targeting that audience naturally comes with problems for those who do, especially because unlike ansible, it’s not as easily extensible.

I haven’t used Pulami in a production environment but I’d dearly like to.

E: vvvvv lmao I thought we were in that thread, death to state

The Iron Rose fucked around with this message at 14:49 on Aug 19, 2022

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
AWS identity centre is great, though it’s kinda garbage that they have permission sets instead of straight up IAM roles. Comes built in with tooling to support CLI access too which is stellar.

It’s important to be aware that this is only for human access to AWS services, whether you access via the API, CLI, or via the console. Machine to machine auth will continue to use IAM roles like normal. Applications within AWS that implement SSO with Cognito or custom oauth/oidc/SAML configs will be unaffected assuming that they delegate trust to your non-AWS identity provider.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

12 rats tied together posted:

IMHO It's much better UX to just use regular IAM roles:

- Create everyone an IAM user (trivial) and configure the user with an MFA device (<10 min per user, async slack message)
- Give them a policy that allows sts:AssumeRole on an arbitrarily complex mapping of AccountName + RoleName
- Add a condition that only allows sts:AssumeRole to succeed with a valid MFA code

That's it. If they want to use the CLI, assume a role. If they want to use the web UI, log in like normal -> assume a role. If you want to swap accounts in the web UI, hit the "Switch Role" drop down in the interface, which has a handy "recent roles" feature. If you want to swap accounts in the CLI, use a role profile, or juggle some vars.

From a management perspective this lets you implement "The product management role has RestartInstances permissions in the UAT accounts" by, very simply, adding a policy element that grants this action, to the product management role, in the UAT accounts.

To do this in AWS SSO you would need to either create a new permission set -> which results in a new option for the user to select, called like "ProductRoleUATAccount", that has different permissions from ProductRoleProductionAccount, or you would need to overload the policy to check for aws:PrincipalAccount -> ForAnyValue -> StringEquals -> each, UAT, account, ID.

If you do ForAnyValue + StringEquals you also have to explain to your auditor why your in-scope permissions policies have "Allow Product" statement in them, but it's totally safe because of this nested JSON, and we ensure that changes to the nested JSON are safe because of [...]. It's bad.

This means you gotta write your own sync of your IDP to IAM users though, which is a large ask. I don’t think the AWS SSO approach of having different roles for different accounts is all that bad, assuming in your example UAT account and Production Account are different AWS accounts.

I can definitely see that breaking down at scale if you’ve hundreds or thousands of AWS accounts, but for most orgs with a double digit or smaller amount of AWS accounts, using AWS SSO is easy if you use one permission set/role per AWS account.

I don’t really believe in multiple roles per user group per AWS Account because if you can sign into role A you can sign into role B, so I question whether or not you gain much meaningful security. “Click the role that corresponds to the AWS account” is not, I feel, a big ask from a UX perspective.


E: AWS def hosed up by tying identity so closely to an AWS account imo. GCP’s projects and Azure’s Resource Groups are much much cleaner implementations of namespacing from a user perspective.

The Iron Rose fucked around with this message at 23:02 on Nov 28, 2022

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

kalel posted:

can someone recommend a decent tutorial project for terraform and/or ansible that has a bit more complexity than "here's an ec2 that prints hello world"

setting up a pihole in the cloud and make it HA with shared configs and blocklists using ASGs, spot instances, and EFS is usually my go to for the people I mentor. Add an ALB or NLB, monitoring with Cloudwatch, alerting and logging, and so on. Make it run in a container and use certbot and HTTPS for your internal domain. Restrict access to only your public IP of course so AWS doesn’t yell at you for running an open resolver, or set up a openVPN along with it with profiles for iPhone/android, computers, and so on. Configure DNS over HTTPS. Deploy your terraform and ansible with CI/CD using GitHub actions.

Lose the load balancer and you can do this all in the free tier.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
Your usecase screams “terraform cdk/pulumi”. Both have rich support for most popular languages including Java, JavaScript/typescript, and your standard Python golang etc. It’s designed to do what you’re trying to do.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Pile Of Garbage posted:

Have to admit I just did that recently for a new cloud-native environment. Put everything in Azure except for the public DNS zones which I put in AWS R53 because it was just easier.

This is one the most cursed takes I’ve seen in this thread

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Hed posted:

We have an Application Load Balancer that is ingress for a k8s application. Most of the time it works really well but occasionally it just times out. Running curl -v https://app.com doesn’t even negotiate TLS and then times out. Once it works it seems to be sticky. Intermittent so it’s hard to debug.
Looking at the health checks for the app it seems fine.

What should I be looking at to debug this? Shouldn’t the ALB negotiate TLS with a client first or is it “smart” and making sure the app is in a good state.


First things first, double, triple, and then quadruple check your route tables, subnets, and firewall rules, and open an AWS support case.

Once you’ve done that, you need to start collecting data, which starts with “what % of requests fail vs succeed, and what is the customer/business impact of this failure rate?” If it’s negligible, it may be worth tossing it into the backlog graveyard. If it causes issues that cannot be auto recovered, or cause SLO breaches, then you need to invest the time required to collect more information.

if this is exhibited with curl and within your client applications, you’re unlikely to see issues solely on the client side, but that doesn’t mean request properties don’t influence the overall result. you need to start instrumentation of both your calling service (whether that’s an API client, a mobile application, or a JS frontend) as well as your upstream k8s receiving service with telemetry events that tell you what combination of properties you can associate with your failing requests. For example, are your failing requests evenly distributed among geographic origins? Source subnets? Evenly distributed among destination handler routes? What about User agents, Unique IDs, Variable values? What combinations of feature flags are enabled? If this is sticky across the duration of a TCP session, what happens when you establish a new session within a process lifetime, and do you measure this information in your client? Do you see a greater rates of failures when your ALB is distributing requests to backend pods in subnet A vs subnet B? What about when your requests come from subnet A vs B? What HTTP methods are most commonly used and do the proportions differ between failed and successful calls? If you find that it’s mostly POST/PUT requests, what’s the size of your request body in bytes? Have you instrumented your nginx ingress controller with the available open source tracing plugins? How many open HTTP connections did your service have open during the trailing one second period prior to the failed request call?

Ultimately you need to identify interesting dimensions and find differences in those dimensions between successful and failed requests. If you are not able to see something obvious from the network or config layers, you need to proceed from first principles:

1. Think about what you are trying to understand
2. visualize (and generate) telemetry data to find relevant performance anomalies
3. search for common dimensions within your anomalous areas by grouping and filtering event attributes
4. Have you isolated likely dimensions that indicate possible sources of your anomalous behaviour? If not, repeat.

Now would be a good time to look into opentelemetry if you haven’t yet. You can get a lot of the above information from automatic instrumentation libraries and open source visualization backends like jaeger.

Even if it is a network routing issue, or an ALB implementation issue, collecting this telemetry information is going to give you the essential data you need to work with your cloud provider’s support services, and will pay extraordinary dividends in the future.

The Iron Rose fucked around with this message at 06:18 on Jun 29, 2023

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

jaegerx posted:

You're running 67 pods and running out of ips? are you using IPAM? what's your CNI? how many nodes? So like the default for eks I think is 150 pods per node but you're not where hitting that.

It depends on how many network interfaces the underlying instance has. If you have small workloads and small nodes, you can’t run all that many pods on them at once. At least when using the default CNI. A t3.small supports a grand total of 11 pods, including system daemonsets.

The Iron Rose fucked around with this message at 01:42 on Sep 13, 2023

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
Unlike both azure and GCP, aws does not have a clean solution to zero trust public access to RDS instances! You can sorta approximate it by using SSM port forwarding to a bastion host. which sucks and you also have to handle timeouts. There’s really not a great out of the box service, especially compared to azure cosmos db’s inherent identity proxy and the GCP CloudSQL auth proxy.

AWS RDS proxy is a replacement for pgbouncer and other similar connection pooling services. It doesn’t do anything with regards to networking.

Sticking public IPs in an allowlist is neither particularly scalable nor especially secure. I really wouldn’t want to do this without an authentication proxy in front of the service.

In general AWS sucks on this front. They have a competitor service to GCP’s IAP/Azure AD App Proxy, but it costs a ridiculously huge amount of money. Something up 24/7 will cost you thousands of dollars a month, minimum. GCP/Azure’s offerings here are free!

The Iron Rose fucked around with this message at 07:02 on Sep 20, 2023

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Hed posted:

Is there a way to get more pods on the same hardware? Currently we are just buying the "most economical" from a cost per interfaces standpoint, but it's really crummy as our instances sit idle almost all of the time. Should we switch to nitro?

It sounds purpose fit to solve your problems, yes!

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
For both learning and professionally I would recommend that you use ECS rather than Beanstalk, because fundamentally Elastic Beanstalk is just opinionated ECS and those opinions may not match up to your desires or needs.

That being said, since your objective is to learn, I implore you to set up both options (with CI/CD and your infra as code tool of choice of course), see which you prefer and why, and then you have both in your portfolio. ECS is also much more commonly used at enterprise scale.

The Iron Rose fucked around with this message at 03:28 on Jan 9, 2024

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
Seconding the SAA cert, but try to get your new job to pay for any video lectures.

In the meantime, literally read every page of existing documentation on IAM, EC2, S3, and VPC. If your new job is heavy in EKS or ECS then do the same with, again, literally every word of documentation they have on the subject.

AWS’ docs are lightyears better organized than Azure’s and you’d need to do this for your exam anyways

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

FISHMANPET posted:

Identity Center is probably in the future, though how far in the future is certainly up for debate. The joys of a tech company still somewhat in "startup" mode that only recently hired an "IT guy" to get Okta going, for example. Who is not me, I'm the latest DevOps person. At least all of our IAM access is controlled via Terraform, so that feels better than nothing.

There is not an easy win here though, I think I'm just gonna pretend to forget about it for a little while and work on something else.

Implement identity center the second you get okta up. It’s not the future, it’s the present, and it is the recommended way to handle authentication by humans to AWS roles. It’s very easy to set up and extraordinarily useful for segmentation of privileges and ease of assuming the relevant roles. Roles are the primary mechanism you should typically be using for identities in AWS - whether human or service based. Death to IAM users. You can often literally get away with 0 of them.

Adbot
ADBOT LOVES YOU

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

BaseballPCHiker posted:

What would be the best way to add a lifecycle rule to existing buckets in an account that dont already have one? Im basically looking to add a rule to delete aborted multipart uploads in buckets.

My first thought was a lambda that would fire and add in the lifecycle rule to buckets, but I dont necessarily know what would trigger that and how I'd put in logic to check for existing lifecycle policies. This is where being lovely with python really backfires for me.

Org wide this isnt really a huge issue for us but somehow its caught the attention of my bosses. Nevermind the thousands we waste in orphaned ebs volumes....

Lambda to enable across the board once (check for existing rules). Then use an eventbridge rule to trigger a lambda that will enable the lifecycle rule on new buckets that don’t already have one.

Or just run a cron and check for a lifecycle rule before applying your own, but that’s a bit inefficient compared to the eventbridge route.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply