Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
12 rats tied together
Sep 7, 2006

Marvin K. Mooney posted:

Hey I'm fuckin dumb and have no idea what I messed up, hopefully this is the right place to ask questions.
I'm making a test site using CloudFront/S3 and I can't get them to cooperate. I have my simple site data in an S3 bucket, I made sure to enable static website hosting, and I made sure to enable read permissions
code:
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "AddPerm",
			"Effect": "Allow",
			"Principal": "*",
			"Action": "s3:GetObject",
			"Resource": "arn:aws:s3:::fart-bucket-content/*"
		}
	]
}
On the CloudFront side, I have a distribution with the origin
code:
fart-bucket-content.s3.amazonaws.com
, origin path blank, bucket access not restricted, default behavior is just "redirect HTTP to HTTPS" "GET, HEAD" everything else is default except I set compress objects to yes.
The issue I'm having is it seems to load up the main page fine, but when it accesses any directories in the fart bucket it comes back with Access Denied, even if there's nothing in the directory but an index.html file. What is going on? Is it a path problem? I tried appending /* to the origin path but then it wouldn't even load the main page. Sorry if this is a dumb question, I'm teaching myself as I go and it's a lot of trial and error.

This is kind of funky because directories aren't really a thing with S3. You could try changing s3:GetObject to s3:Get*, and then try copy pasting another statement section where the Resource is just the bucket ARN (without the /*)?

e: actually, I'm sorry, that fix is for an annoying gotcha when you are delegating s3 access to IAM users. It sounds like your problem is that you're trying to read a directory, when those aren't really a thing except as displayed through the s3 web interface. For one of the directories that just contain an empty index.html file, what happens if you try to just view index.html directly? Is that also an access denied?

12 rats tied together fucked around with this message at 04:18 on Feb 10, 2017

Adbot
ADBOT LOVES YOU

12 rats tied together
Sep 7, 2006

You caught everything in the docs that I'm aware of (esp. sts:assumerole with restricted permissions). I would probably just attach the role to the cluster, run the unload, and then remove the role if you absolutely cannot leave the role on the cluster for some reason. Clusters can have multiple roles attached so it's not like you're potentially clobbering someone else's credentials. I also don't think you actually have to put the credentials in the query text - just the role ARN (which isn't powerful to have or know).

I think the only permissions leak is that you would be allowing the cluster to indefinitely write to a location in s3 that you care about? You can probably work out something sufficient with s3 permissions to prevent other users from being able to do the same. This is pretty awful but you could copy your data from /ingested into /approved. Only allow your credentials to write to /approved, and only give the cluster role permission to write to /ingested.

Generally things are a lot easier though when you can trust most other entities in your account though

12 rats tied together
Sep 7, 2006

I pushed terraform in my current role after a brief argument where I tried to convince everyone that cloudformation + ansible was better. It's really awful, I wouldn't even say that it's catching up to cloudformation, it's actually getting worse IMO. For context we're using it for basically every AWS region, AWS China, Alibaba cloud, and a few other ancillary services.

I'm sure if you're just using it for a relatively simple stack, or (better yet), if you don't have to manage the entire AWS account and you just get a state path from your operations team, it's wonderful. If you're in more of a traditional role where you manage the entire account alongside a security team, definitely go with cloudformation.

If you are stuck using Terraform you can just use cloudformation anyway: https://www.terraform.io/docs/providers/aws/r/cloudformation_stack.html and you should totally do this every chance you get.

edit, if you spend any serious amount of time with Terraform you'll find that:
  • Properties are not named consistently (e.g. iam role policy attachment takes a role name, iam role policy takes a role id, these both refer to the same piece of information that is named something different in each resource for no loving reason).
  • What happens when you change a property is not documented at all (compare to the 49 instances of "update requires" on the AWS::EC2::Instance docs, with links to exactly what each type of requirement means).
  • Cloudformation often makes live changes in a more HA fashion than Terraform: compare changes to EC2::SecurityGroupIngress to changes to an aws_security_group resource. Terraform deletes every rule from the security group and recreates them all, of course, you'll have to actually run terraform in debug mode to figure out that it does this (or look in cloudtrail logs) because it is completely transparent until you push go.
  • If you create resources in a list using var.count, and then you extend that count, any resources that depend on that full list of resources (for example: creating ec2 instances, and then later using the instance ID in something like a load balancer attachment), the "list of each instance's instance id" becomes a "computed list" which means Terraform will literally, in this example, detach all of the instances from a load balancer and then reattach them. You won't find this in the docs, you have to find the github issue where apparentlymart explains the behavior and gives you a workaround. In this case the workaround is also not in the docs, and is extremely unintuitive: select from your array with "[]" instead of using "element()".
  • CloudFormation has the special value AWS::NoValue which basically cancels out an instance of an argument. This is really useful if, for example, you want an ELB that sometimes has an HTTPS listener that sometimes has a certificate ARN. The problem here is that the parameter cannot be null when you don't have a certificate arn, so you cannot use the ternary that Terraform recommends, because if your expression has an invalid reference (to a variable that is null or undefined), Terraform will helpfully evaluate both sides of the ternary and error: https://github.com/hashicorp/terraform/issues/11574, https://github.com/hashicorp/terraform/issues/5471
  • Terraform is extremely anti-DRY. Large portions of the config explicitly do not support interpolations of any type. For example, you cannot set a "default module version" somewhere and track all of your module invocations with it. You do a loooot of find+replace with larger, more complex Terraform configs. You get to do this work twice too because someone has to review all of your changes.
  • Terraform does not output JSON versions of plan. If you want to say, write an integration test that makes sure that every EC2 instance that is being created has a certain tag, you get to write a script that parses terraform plan output as displayed in a terminal (at least you can use the -no-color flag!). Compare to CloudFormation changesets for a version of how this feature would work if not implemented by morons.
  • null_resource exists and is a recommended approach for dealing with certain types of problems, which is just mind boggling to me. Why would I ever want my declarative configs to enforce the existence of something that can't exist?
  • There is no mechanism for triggering updates across state boundaries -- you cannot say that updates to state "networking" should notify updates to state "load balancers" to pick up new outputs. You have to define these yourself, using contextual knowledge, and then you have to remember to run apply from the right folders in the right order.

These problems are in addition to the fact that Terraform often lags behind cloudformation in terms of what resources they support. It's really, really hard to recommend terraform and I sincerely hope you do not use it if you can avoid it. So much that I wrote this huge effort post.

12 rats tied together fucked around with this message at 00:54 on May 30, 2018

12 rats tied together
Sep 7, 2006

Totally agreed, however the hashicorp enterprise team that got in touch with us was really confused when we told them we had a state monolith and repeatedly referred us to the documentation and best practices guides where they say to create many small states. We even had some questions specific to our state monolith (such as: how are workspaces as a feature actually supposed to function?) that they would not answer until we changed our repo to be something they could actually comprehend.

It's really easy to gently caress your state up, really easy to squash changes someone else made, and if you work in an environment where terraform is optional (god help you if so), you get to see every instance of a manual change someone ever makes to something that was ever in TF.

It's still better than using modules the way they tell you to though.

12 rats tied together
Sep 7, 2006

Blinkz0rz posted:

I'm surprised so many folks have had trouble with Terraform. We went the Cloudformation route with our own custom DSL and it's been a nightmare to build out and maintain. We began migrating to Terraform and more than 1/2 of our accounts are switched over and man, it's been like night and day.

YAML anchors exist so you could, and totally should, make your YAML DRY. There are tons of really good reasons to make your infrastructure-as-code DRY. Agree that Terraform is not code, but it not being code is basically a regression from other systems (including Cloudformation) that either are functionally psuedocode or are generally augmented with an orchestration tool that makes them into literally code. For example: Ansible, Sparkleformation, an actual templating engine, or any of the number of DSLs that exist for Cloudformation.

I've spoken with a number of people about the perceived limitations of Cloudformation, and especially have been involved with protracted discussions at two employers now about the relative merits of both TF and Cloudformation, and my sincere belief is that a struggle with Cloudformation is genuinely a lack of knowledge or poor tooling more than it is a problem with the service itself. Working with purely Terraform in a reasonably complex environment for the past 14 months or so has only cemented that belief. Deciding to write your down DSL is a huge, huge red flag for me so unfortunately I'm not terribly surprised it didn't go well for you. Are you sure it was Cloudformation and not just poor implementation? Do you have any details?

If you're logging into the web interface and clicking Create New Stack From Designer I agree it's a pretty terrible service. If you're taking it any further than that though it's seriously best in class by fair, fair margin. Especially compared to Terraform. IMO anyway.

edit:

quote:

I'll shill kitchen-terraform here: https://github.com/newcontext-oss/kitchen-terraform . This is a set of Test Kitchen plugins that lets you test Terraform changes by converging them in a separate workspace and testing the results with any RSpec-based tests (Inspec, Serverspec, AWSpec, etc.)
It's come up a few times for me at work, I'm not wild about it. AWSpec is interesting in theory but kind of a waste of time IMHO. I'm sure it's useful in some contexts but in my experience anything I can verify with kitchen-terraform I'm already inheriting or reading from an environment.

What I actually need to test for is stuff like: Does our us-east-1e have capacity for this instance type? Does this route table change break connectivity from service A to service B? Will this PR make us exceed our VPN connection account limit? Are we out of available addresses in this subnet? Is merging this PR going to make secops really really mad because we broke a compliance control?

More broadly I think I would say that I just don't think unit tests are useful for infrastructure as code. IMHO canary environments, rolling deploys, and really robust monitoring and alerting give you way higher ROI past a certain point of complexity, and unit tests generally just become a waste of time. I'm really interested in being proven wrong here though.

12 rats tied together fucked around with this message at 05:36 on May 30, 2018

12 rats tied together
Sep 7, 2006

Erwin posted:

Unit tests is inaccurate, but that's not important. Again, kitchen-terraform is geared towards testing your reusable modules, which if you have a Terraform configuration "past a certain point of complexity" you should be doing. You can certainly also test your base state-producing config with kitchen-terraform, but it's easier if the tests mainly focus around each module in their own repo and you're mainly testing your base configuration for successful applies.

I do actually appreciate your point by point breakdown, thank you -- however reusable modules in terraform is kind of a running joke for me in my 9-5 right now. It's not that they're impossible, it's just that "past a certain point of complexity" the tooling literally falls apart.

Here's my favorite from since the last time I posted in this thread: https://github.com/hashicorp/terraform/issues/9858#issuecomment-263654145

The situation: a subnet module that stamps out boilerplate subnets, sometimes we want to create between 1 and 6 nat gateways, other times we want to pass in a default nat gateway, other times we don't want to create any nat gateways at all. A couple of problems we've run into so far:

Both sides of a ternary get executed regardless of which one is actually used, so in a situation where you are creating 0 nat gateways, your aws_nat_gateway.whatever.*.id list is empty, and element() and friends will fail on it because you can't pull a list index out of something that is null.

Coercing something that is null into something that is a list by appending a dummy string into it and incrementing your index grabber by 1 every time doesn't work if you need to wrap this list in any way (since you would increment past the list boundary and then end up grabbing the dummy value). Explicitly casting it to a list might work, but both sides of a ternary _must_ be the same type, so you can't be like "this value is either (string) the default nat gatway, or index 0 of (list) the actual list of nat gateways, or (list) the fake empty list I made so you wouldn't error for no reason.

Basically we have like 50 instances of "join("", list(""))" and "element(join(split(concat())))" in all of our "reusable modules" and the project has gone from hey wow this syntax is kind of messy sometimes straight to this is unreadable garbage that is impossible to maintain and we're not doing it anymore. For a CloudFormation comparison you would just use AWS::NoValue when necessary and then be able to actually do your job without spending a full 1/3rd of your day combing through github issues from 2016.

12 rats tied together
Sep 7, 2006

Votlook posted:

What is a good way to manage ssh access to ec2 servers?

The way I've seen this done in the past is cloud-init at launch to get a base set of keys on the instance, and then ansible playbooks take over and ensure that authorized_keys lists are up to date for everyone or everything that should be using the machines.

In an ideal world you would just put whatever key your closest compliant AWX server uses and call it a day, if you have a new hire on a team that should have admin access to a server, that new hire can just trigger an awx playbook run from master after merging in his public key. If you aren't using AWX, you just have whoever is helping onboard that guy run master after merging in new public keys.

IMHO this particular situation is like one of the textbook reasons not to go overboard on baking amis for every type of change. I actually hate baking amis a lot.

12 rats tied together
Sep 7, 2006

https://www.hashicorp.com/blog/terraform-0-1-2-preview

4 years later we get an announcement that were will be a for loop, later this summer. Nice.

12 rats tied together
Sep 7, 2006

If you're comfortable splitting up your script into containers that can run simultaneously I would recommend fargate / ecs as a good middle ground between something super heavy like Glue/DataPipeline -> S3 -> Athena and something super lightweight like running a bigger ec2 instance.

Another alternative would be writing a lambda function that recursively calls itself until it's done, but you should be careful with those if the aws account is on your own dime.

12 rats tied together
Sep 7, 2006

Definitely recommend not using AWS if you can avoid it.

If you must use AWS I'd also really suggest you start with the managed services like beanstalk, emr, athena, redshift, etc. I've joined a few orgs now where several years of effort have gone into reinventing "basically _____ but worse" and it's always a nightmare mountain of technical debt and team silos.

If you feel like you can't use whatever the managed service is for your use case it's always worth engaging your TAM / support team and confirming your suspicions. Generally I've had good experiences with account management staff being upfront about "yes, x service will not work for your use case at this time, but we have y,z feature requests open and we will keep you updated".

12 rats tied together
Sep 7, 2006

Agrikk posted:

Yes, please go encourage others to spend business capital on non differentiated work in a dev-shop. Stand up your VMs and your email and storage and then hire people to manage that stuff in that office of 3-4 developers.

To clarify, I'm coming from the mindset that in an ideal situation your company pays for an internet connection and then magically makes money from the internet. All tech is essentially tech debt in some way or another, if you can run your entire business profitably on top of zendesk cloud, g suite email, and google docs/sheets you should absolutely 100% do that instead of spending a single dime in AWS.

That being said we are in the cavern of cobol so this might not have been as obvious an assertion as I'd liked. If your business must write code that is executed by compute, AWS is easily the best place for it to run unless you have some very specific needs. Just to be clear. :shobon:

12 rats tied together
Sep 7, 2006

CloudFormation is frankly awful if you're using it by itself without any kind of helper or document templating. There are a bunch of open source projects that can help though: troposphere, sparkleformation, ansible, etc. For a single account, AWS only environment I would really recommend going with CloudFormation + a helper of your choice.

If you only want to learn one tool, Terraform is not so bad now that they've released 0.12. You will undoubtedly run into some problems with it, but so did the rest of the internet, so it's not too hard to find information or advice. The documentation now is also a bit better than it was last year.

12 rats tied together
Sep 7, 2006

necrobobsledder posted:

There is absolutely nothing like Terratest out there for Troposphere code though

This is an interesting point and, in my experience anyway, it seems like it's not something people really consider when discussing IaC tooling: If your tool is a language, a library, or some attempted combination of both, it's going to be really hard to write tests for it. If your tool is a markup language, it's comparatively trivial.

This is why I generally do not recommend Terraform, IMO it's a "worst of both worlds" approach where you get almost none of the benefits of an actual programming language, but the markup language itself is also worst-in-class by just about any measurement you might care to take. When you use something like Ansible + CloudFormation you have complete control over every stage of your abstraction -> markup rendering phase. It's absolutely trivial to test anything about it, even using something as simple as assert.

You can use assert to perform preflight checks on your input data like "assert that the various name slugs, combined, do not exceed 64 characters", rendering to CloudFormation template just creates a yaml document which you can yaml.load and do your thing, and then you have all of the normal ways of testing AWS infrastructure: test accounts, environments, alarms, etc.

By comparison, you couldn't even get plan output as json from terraform until earlier this year (you had to parse shell command return codes which is comedy gold for IaC tooling), and the json plan output is frankly insane compared to CloudFormation change sets.

Basically: the more you try to abstract away the part where you have to actually transform your intent into serialized resources, the harder it becomes to do simple stuff like "hey make sure this alarm doesn't fire after you roll those new ec2 instances". To me the absolutely gigantic terratest readme looks more like parody than actual tooling.

I agree that for OP's use case it really doesn't matter which one they pick though.

12 rats tied together fucked around with this message at 17:03 on Aug 1, 2019

12 rats tied together
Sep 7, 2006

Fair point about tfjson (and friends) -- if you dont mind integrating multiple third party tools you can get around some of the awfulness of using Terraform in production. I'm a big fan of landscape, personally (especially for the OP who is considering tools still).

Re: horrific release parties, I can't say I've ever experienced that, but like I said previously I've never been a huge fan of cloudformation and only cloudformation. What you describe with containers and relying on cloudformation/terraform only for base infrastructure matches my own experiences, but I don't think it's specific to containers, you can get all of the modern niceties using only ec2/cloudwatch/route53 by using ansible -> cloudformation -> ansible.

I've talked a lot about it without giving a concrete example, so I'll try to briefly illustrate without creating a post longer than a page:
code:
ansible/
  playbooks/
    aws-$account-$region-base.yaml
    roles/
      aws/vpc-setup/
        tasks/, vars/, templates/, etc
      $app/$component/
        tasks/, vars/, templates/, etc
Each account-region's base.yaml is a playbook that calls a series of roles: vpc setup, per-application setup, ancillary service or config setup, etc. They also usually contain a series of preflight checks, sanity tests, etc that are implemented as pre_tasks so we can ensure they always run before pushing changes. Big one here is usually making sure you're targeting the correct account, since that is handled outside of ansible.

Items in roles/aws/ configure aws primitives. The expectation is that they are more or less like a terraform module -- if you need 3 vpcs, you call "vpc-setup" 3 times and feed it 3 sets of parameters. These use cloudformation to provision themselves and track state using "register". A service that runs on aws infrastructure will provision resources through a role in this subfolder. Often these fire a number of post-assertions to make sure we didn't blow something up in a really obvious manner.

Items in roles/$app/$component configure services and servers, or perform other ad-hoc config that is not supported by cloudformation (example: permanently disabling an autoscaling process).

Inside the playbook we track the result of cloudformation calls into a series of vars which are available inside the application stack roles. The overall play strategy, ordering, chunking, etc is handled in the playbook, so the scope of working on a role
is intentionally kept extremely limited. The playbook is also where all of the control logic lives -- your typical "canary deploy -> if metrics stabilize do a blue/green in chunks of 20% -> otherwise rollback the canary and fire a grafana annotation/slack message", typical deployment logic which ansible supports extremely well.

The result is much better to work with than anything you can do in Terraform -- the entrypoint to infinitely many terraform state folders and module dependencies is a single playbook. Ansible role dependencies are well engineered, simple to understand and debug, and the concept of an "ansible role" is well geared towards managing _any_ type of multi-resource/multi-dependency environment (not just servers as the name would have you believe). This is where you put all of your (well phrased) surgical and cross-functional concerns and logic.

You get all of the benefits of working with modules with all of the benefits of working on a monostate, you can trivially debug, inspect, and short circuit any stage of any process including the ability to drop into an interactive debugger in the middle of a playbook run. Lastly, you also get Terraform Enterprise for $completely free through the open source version of Ansible Tower, which runs the exact same playbooks and automation in the exact same way. Working like this you get the scalpel and the bulldozer, which is awesome because you really do need both.

12 rats tied together
Sep 7, 2006

necrobobsledder posted:

Like 5 posts in we're at Defcon-1 CI / DevOps / K8S land rather than being AWS-specific

This is an interesting statement. IMO the process I outlined is a framework for sanely managing complex relationships between dependent resources that mutate over time. I think calling it CI / DevOps / K8S land is kind of dishonest given that it can manage AWS RDS instances, AWS EMR clusters, and other extremely stateful pieces of infrastructure that simply cannot exist in K8S, and have no place in a CI / "DevOps" toolchain except as an environment variable value.

I think, if you're doing IaC right, your AWS resource management is almost indistinguishable from your K8S resource management, but effectively managing an AWS account and effectively operating a K8S cluster are still two different things. IMO, anyway.

For what it's worth: k8s cluster provisioning, deployment, bootstrapping initial services, and applying addon services in my current role is exactly the same as provisioning an AWS resource stack except you replace instances of cloudformation with instances of k8s. It's a fantastic workflow and I highly recommend it.

StabbinHobo posted:

the only useful info in those answers was "hasn’t changed for like 3 years" (and therefore I should probably just follow the old blog posts).

Bhodi is absolutely right that you're going to have to pick a tool and get started before we can go into any more detail, but there are a lot more details to go in on. I would not recommend starting with blog articles as a general rule though.

12 rats tied together fucked around with this message at 15:43 on Aug 7, 2019

12 rats tied together
Sep 7, 2006

StabbinHobo posted:

whats the right ci/cd pipeline setup for IaC/cloudformation work?

You're right, you did name a tool, I'm sorry for missing that.

IMO, don't do CD with CloudFormation. My very specific toolchain works well by using a mixture of create_change_set and validate_template for CI along with the aforementioned ansible assert.

Change sets have an "execute" button you can click manually. IMO including a link to the change set with a pull request and clicking "execute" manually is superior to automatically executing your change set from a CD agent, but most CI/CD tools have a concept of build (create change set) vs deploy (execute change set) that you can use too. If you have a compliance thing involved here you can restrict execute_change_set permissions to your CD agent and have a documented workflow where an engineer reviews a changeset and approves a PR to kick it off, auditors are usually pretty happy with this.

Actually choosing a CI/CD tool depends more on what you're using for version control than anything else in my experience. Gitlab runners are popular and work reasonably well, CircleCI on github is okay too. I've used TravisCI and I was not a fan but it was functional. I don't think you can go wrong here unless you tried to write a ci service yourself that manually consumes github web hooks.

e: If you haven't done cloudformation before I would recommend taking a look at stack policies at your earliest convenience and to start thinking about them ASAP.

In my experience the "I would like to run tests on my IaC" desire is kind of a version of the xy problem where the actual desire is to be able to safely make changes to production resources, and it's manifesting as a stated desire for a comprehensive test suite.

You can skip a lot of the landmines involved in accurately writing tests for IaC by just outright blocking actions that would cause downtime. There's no situation where a routine update should take down a database, so everything involved in the access pattern for your database should have a stack policy that blocks Update:Replace and Update:Delete, and now you don't need to write a "the database should not become unreachable during updates" assertion and verify that its true in a test environment every time you merge a pull request.

12 rats tied together fucked around with this message at 19:02 on Aug 8, 2019

12 rats tied together
Sep 7, 2006

Thermopyle posted:

1. One thing that I think could be better is that the project has a redis server acting as a task queue and python workers running on one instance. If i understand correctly, if I'm using AWS, I should probably move those python workers over to Lambda, no? Then I can just eliminate redis and replace the code that sends tasks to the workers via redis with code that starts lambda tasks (or whatever the lambda terminology is)?

2. Some of these instances call HTTP endpoints on other instances via public dns addresses...I should just use the VPC local address, right?

1) You can do this (replace an ec2 task worker with triggered lambda functions) but you don't have to. Make sure that none of your workers need to run a task for longer than the function timeout. While you're reading these docs, make sure that you won't exceed any of these other limits as they can't be increased. In particular, be aware that concurrent function executions are shared per-account, per-region, and by moving these workers to lambda you're sharing that constraint amongst them and all other functions that execute in your account.

A robust task/worker queue implemented in ec2 is functionally not very different from sqs/lambda, and there are some benefits to doing this yourself that can't be replicated on lambda. In the past I have never been able to responsibly justify a migration here. You can switch redis from ec2 to elasticache if it isn't already there, that's trivial to do and is a huge reduction in management burden without also possibly springing a fundamental tech change on a software team who doesn't particularly care and just needs to schedule business logic.

2) Yes, the private IP address of the instance or its private hostname if you prefer, they're basically the same thing. This will be faster, possibly cheaper, and definitely easier to maintain moving forward.

Scrapez posted:

[...]
The two VPCs that the two mentioned subnets are in have a VPC peering connection established between then but not sure that would have any impact on adding route for NAT gateway.

I'm guessing it's something glaringly obvious but I'm not seeing it. Anyone help me out?

Docjowles is correct in that you cannot use one VPC as a transit network for another. If you have two VPCs here, you need two NAT gateways. They describe this in the documentation but you kind of have to dig for it: link. If I'm understanding what you want to do correctly, you're trying to run the shown invalid edge-to-edge configuration. Please let me know if that's not the case though, you might indeed be missing something silly because the UI for nat gateways is kind of weird.

12 rats tied together fucked around with this message at 21:33 on Aug 15, 2019

12 rats tied together
Sep 7, 2006

Permissions boundaries should work for you, unless I'm really missing something here?

You would basically:

  • Create a managed policy that blocks all of the stuff you want blocked (any action, not just iam:Create*), these have to be Explicit Deny statements though
  • Create a permissions boundary using/bound to that managed policy
  • Apply a policy to all of your applicable principals that allows iam:Create* with a Condition block that specifies that a permissions boundary must be attached, and it must be the boundary you created earlier

Your users can't create users or roles that do not have your permissions boundary attached. The boundary contains your explicit deny blacklist, which supercedes any explicit allow statements that your users include. They can't detach the boundary, and they can only create entities with the boundary attached, so your explicit deny blacklist is always present and always takes precedence over any policy your users configure.

They can still create whatever policy they want, with any content they want, they just will never actually get permissions that exceed what you configure for them initially. This should all be doable in the console.

12 rats tied together
Sep 7, 2006

I barely know how to use the web ui for AWS because my primary interface to it is text editor and terminal. The services where the UI is part of the value add like logs (and insights), EMR, lambda are all great, except DataPipeline which is garbage.

Maybe this is a bit of a hot take but I would never touch an SSM parameter in the UI -- I'd probably end up explicitly blocking that on our admin users if it ever comes up. Basically I think you should treat your AWS account like a database, every real change should be applied through a tagged migration or some facsimile. The interface is only useful for its ability to colocate bits of relevant information and letting people poke around in object/log storage.

12 rats tied together
Sep 7, 2006

I do, yes, but I would only specifically take issue with modifying a parameter in the UI.

I see the issue with systems manager (no longer "simple", iirc) as basically the issues with the wider industry going really hard on Chef/Puppet, AWS expanding their offerings to include OpsWorks, and then everyone remembering that managing server config is only like a third of the problem space for infrastructure engineers (and it's not really the hard part either). It's probably a serviceable orchestration tool but I wouldn't know, I've never needed to use it.

Ansible Tower is the closest thing to a paid product that replaces systems manager that I'm aware of, but I would describe it as worse or at least equally bad. Similar to EMR/DataPipeline, I think this is just one of those areas that is fundamentally difficult to communicate through a UI.

12 rats tied together
Sep 7, 2006

a hot gujju bhabhi posted:

Not AWS but hoping someone can help. We have a Varnish server configured to cache requests and behind that we have an Azure load balancer that balances between 3-4 VMs depending on requirements. The problem is that something about the Varnish server being there is causing the load balancer to go stupid and it seems to be confusing the traffic as one visitor and sending it all to the one VM. In other words, it doesn't seem to know or care about the X-Forwarded-For header when determining where to send requests.

Am I right in this assumption? Is there any way to configure the load balancer to ignore the client IP and use the X-Forwarded-For header instead?

There are a lot of different ways to load balance traffic but commonly you'll see a load balancer perform some kind of source NAT on incoming traffic, replace the destination IP on it with a selection from its available targets, and then forward it along.

The target receives the traffic and perceives it as originating from the load balancer on an IP address level -- almost all of the time this is a good thing. Your target will respond back to the load balancer which usually implements some kind of connection-level tracking and caching and the load balancer does the same thing again: switches the source IP on the traffic to itself, replaces the destination IP, and forwards it back to whoever sent the original request.

If you're load balancing traffic like this you actually want the client IP regardless of the X-Forwarded-For header, those headers are usually application specific and outside of some specialized use cases you generally don't want your load balancers inspecting them.

If you're seeing your requests through your load balancer not actually being balanced and you've confirmed that you aren't intentionally doing this by setting sticky sessions or similar, you should probably start by answering 2 questions: Are all of your targets healthy, and what algorithm is the load balancer using to balance traffic? It looks like azure load balancers default to a 5-tuple hash based algorithm? The linked page has better documentation but the short version of this is that any time any attribute of your traffic changes, you should get a new backend host.

For something like varnish initiating requests to backend servers through a load balancer, each individual request should have a different source port, the source port changing is what should get you a new backend host. You should be able to find out whether or not this is happening pretty easily by tcpdumping from your varnish host and looking at the outbound traffic.

12 rats tied together
Sep 7, 2006

The "deny notlike" is super giving me a headache. I would suggest turning it into an "Allow, StringLike" if you can.

12 rats tied together
Sep 7, 2006

That sounds fine to me. An alternative would be distributing credentials or otherwise allowing direct access to cloudwatch.put_metric_data() in the relevant AWS account, which IMO would be preferable if all of the clients were things that you controlled.

It may seem like a lot of moving parts but api gateway to lambda function is standard enough that it won't especially be an ongoing support nightmare or cause other people who may work in your AWS account too much annoyance.

12 rats tied together
Sep 7, 2006

We attach tons of customer managed and aws managed policies to stuff all the time. It's better IMO because it decouples the creation of the policy from the application of the policy to a principal. For auditing, for example, you can have controls around the contents of the policies (especially, restricting updates to them) and you can have separate controls and permissions around attaching them. To verify that machines are meeting compliance, you just list the attached policies and compare. If you do any work with complex cross-account permissions you'll probably end up doing stuff like this anyway with permissions boundaries, so the possibility of code re-use is pretty high.

Embedding policies is something we also do, but we do it with jinja2. It sucks in exactly the way you describe and I found the list of attached, managed policies to be a lot easier to work with, but it's very possible to just render a policy document from some other source.

CyberPingu posted:

Ah ok might be better asking there but I'll pop it here too.

We build our infrastructure as IaC, using terragrunt/Terraform. Ive just finished building a module that creates cloudtrail and associated logging for CT. If I wanted to use a log group that gets created by that module when running terragrunt how would I go about that? Would it need a dependency?

Just to double check, are you familiar with terraform outputs and how you consume them from modules? I don't use terragrunt, but I took a quick look at the docs, and it seems like consuming module outputs is basically just normal terraform.

12 rats tied together
Sep 7, 2006

Pile Of Garbage posted:

Basically what dividertabs said:


I had a bunch of CFN stacks deploying EC2 instances and I really needed the CFN to gently caress-off because it was bad and a risk so I enabled API Termination Protection on the EC2s, deleted the stacks which failed and then deleted a second time selecting to retain the EC2 instances. Janky af but it worked.

It was a situation born out of a rushed project with insufficient time to design or prototype things.

This would be a variety of yellow flags for me in my employer's prod account.

There should be no situation where an ec2 instance can't be arbitrarily deleted and recreated, is the first one. Any application running in aws needs to be able to tolerate the random and immediate loss of at least one ec2 instance -- it doesn't have to be automatic recovery or even anything more fancy than 'restore snapshot or move volume to new ec2 instance' -- enabling a configuration option to intentionally cause another service to fail is probably the least operationally sane way to accomplish getting something out of a stack.

The best way to get this done, IMO, assuming you can't just reprovision the ec2 instance for some reason, would be to add a DeletionPolicy to your cfn resource(s) that specifies retain and then to delete the stack. Most importantly this gets you git history, but it also makes your aws logs cleaner and easier to audit if necessary because you have a single successful operation against a resource in a known state instead of a variety of failed api calls and overrides made through a proxy service.

While reading the documentation for DeletionPolicy you would also probably come across the UpdatePolicy specification which could address any concerns you have over the cloudformation stack being a risk. CloudFormation is an extremely safe and reliable service if you take the time to read about all of the functionality it provides.

12 rats tied together
Sep 7, 2006

For Dynamo scaling the best thing would probably be to have your upstream team post a "we're getting started now'" event and use that to preemptively scale your table up. You would be paying extra to prewarm, but not 24/7 extra.

You could also drop EMR and manage your own spark cluster that you hibernate. I don't know enough about spark cluster internals to determine if this is better than EMR or just different bad.

It kind of seems like you want all of the benefits of unlimited scaling available to you instantly with no prewarm duration and no extra costs which isn't super realistic in today's AWS IMO.

Have you tried any of the redshift features in this area? Redshift Spectrum your stuff from S3 directly into a table or a view or something and then convert your clients from Dynamo calls to Redshift queries? I'm not sure if the redshift parquet feature is faster or slower than EMR though, I've never used it.

12 rats tied together
Sep 7, 2006

Redshift is basically a specialized form of clustered postgres, so you have most of what you can do in postgres available to you and most of what you would expect from SQL like basic math, time and date stuff, you can find a better list here. It also supports user defined functions which you can write in python or sql.

Redshift clusters also run 24/7 and have significant operational overhead compared to a dynamo table, you need to pick a dist and sort key, if you pick a bad one your queries are going to be significantly slower and might exceed that <100ms threshold. I would suspect that you could probably get your reads to be "about as fast, maybe faster/better" if you can do any sort of batching and caching instead of needing to pull single keys truly at random, redshift does do some pretty decent query response caching though and that is enabled by default so repeated calls to the same key would be reasonably quick.

12 rats tied together
Sep 7, 2006

it's a little unfair to consider govcloud and aws china as even being the same business, IMO, but yeah they are both pretty bad

my one bad experience with the docs to date, about 5 years working with aws, was when the site to site VPN docs specifically recommended you use the same BGP ASN on all of your spokes in a hub/spoke topology which ended up causing a bunch of routing issues that I was not mentally equipped to understand at the time

I filed a support ticket and they got back to me instantly after having identified the problem, apologized for the documentation error, and fixed it right away

12 rats tied together
Sep 7, 2006

I am really struggling with the json syntax here instead of yaml but I would guess that CidrBlock -> Ref: "10.5.130.0/27" is not a valid usage of Ref. IIRC it's only used for referring to other logical resource ids or parameter names (ex: "subnet0c41cd3e1702cc8a8"), it can't be used for composing string values like you have here.

Instead I think you can use just sub directly?
code:
 "CidrBlock": "Fn::Sub": [
	"${sub_region_CIDR}.130.0/27", {  "sub_region_CIDR": { "Fn::FindInMap" : [ "Region", {  "Ref" : "AWS::Region"  },  "regionCIDR2Octet"]}}
]
Or, in yaml :shobon::
code:
Type: AWS::EC2::Subnet
Properties:
  CidrBlock: !Sub
    - ${sub_region_CIDR}
    - { sub_region_CIDR: !FindInMap [ Region, !Ref: AWS::Region, regionCIDR2Octet ]
edit: I think in this case the error message is because the value of the Ref is a dictionary, not a string.

12 rats tied together
Sep 7, 2006

Twerk from Home posted:

How do you guys successfully handle IAM roles for whatever process is doing your deployments?

I'm having a hard time striking a balance between permissiveness and actual practical ability to deploy applications that are actively changing and evolving. Any type of least-privilege role for deployment has to be constantly updated whenever we integrate a new AWS feature, and nobody's going to prioritize removing unused permissions from the role when we stop using something so it doesn't stay a least-privilege role at all.

This is always some form of whack-a-mole but experience and muscle memory can help a lot. The useful google incantation here is usually "actions context conditions <service name>", which will pull up the IAM documentation that fully enumerates all of the things available to build a policy for a service. You can use this plus your CDK output to vet any least privilege policy, and then integrate a permissions audit process at whatever cadence your security team works at to make sure that the permissions are actually being used.

It's relatively simple to take, for example, 90 days of cloudtrail event data and parse it for "all actions granted to principal x that do not appear as the action in any of these cloudtrail events". I've done this with SumoLogic, but you can probably work something out with Athena or CloudWatch Log Insights or whatever they're calling it these days.

The tricky part here is when you have to start evaluating "did this principal exercise this very specific s3 bucket + prefix permission?", which gets complicated because s3 permissions can have wildcards. I've used python's glob library for this in the past which was easy enough, but, in general it's good to avoid complicated s3 permissions that need to be audited in this way.

12 rats tied together
Sep 7, 2006

In general this is where I switch to python/boto3 but the thing you're looking for here is the --query param, where you can do a filter plus starts_with. Googling "awscli query starts with" will get you some examples.

boto3 is really really good though and this type of work is much more intuitive in it

e: sorry, misread, you probably want "contains" instead of "starts with". The thing in use here is called JMESPath if you want to read the full specification.

12 rats tied together
Sep 7, 2006

Using AWS native features only, 2 options immediately spring to mind:

1: Use whatever you're creating the user and group membership with to also create a scheduled lambda function that runs 1 week after creation which does the user cleanup.
2: Instead of putting a user in a group, create an IAM Role, and apply a policy to the user that allows sts:AssumeRole only between now and one week from now.

12 rats tied together
Sep 7, 2006

ECS has way better cloudformation support, is the main reason to stick with it.

12 rats tied together
Sep 7, 2006

I worked at an org that went from AWS only to AWS + GCP + Azure + Alibaba Cloud and it was fine, easier than Cisco -> Juniper or Windows -> Linux IMO.

The main thing I found was that all of the non-AWS cloud providers are awful, except for Alibaba Cloud, which is basically a copy paste and global find replace of AWS.

12 rats tied together
Sep 7, 2006

Yeah if you can structure your AWS account such that you only need to audit the inputs to the account, that ends up being an order of magnitude easier to manage, audit, and scale. It also ends up being an order of magnitude cheaper but the costs for Config, GuardDuty, CloudWatch, etc., are so minimal compared to the usual suspects in AWS that it's not a big deal.

This also ends up increasing developer productivity long term, you just have to be able to actually do it and continue doing it, which is above most teams. I've only worked at a single place that really managed to pull it off, but it was wildly successful, we went from "0 security except accidental" to "the federal government says its OK to give us census data" in about 6 months with a single infosec hire, and he was really more of a compliance engineer than he was "secops" or whatever.

12 rats tied together
Sep 7, 2006

The original assume role event has a userIdentity field, example from a quick docs search: https://docs.aws.amazon.com/IAM/latest/UserGuide/cloudtrail-integration.html#cloudtrail-integration_examples-sts-api

The contents of the userIdentity field depends on a bunch of things, since basically every IAM principal in AWS can assume a role in some way. If you're looking for AD auth you probably want AssumeRoleWithSAML, which will have a userIdentity that includes a principal id, identity provider, and a user name which should be able to identify your AD user.

e: my apologies, web console logins have their own event: https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference-aws-console-sign-in-events.html

but same info applies re: userIdentity

12 rats tied together fucked around with this message at 17:58 on Aug 31, 2021

12 rats tied together
Sep 7, 2006

double post for formatting

I checked this out at $current_employer and AFAICT the full auth flow for Web Interface via Azure AD goes like -

1. AssumeRoleWithSAML (user agent aws-internal) fires,
1.a. you can parse identityProvider + userName out of the userIdentity field and this points to an Azure AD user
1.b. you probably want "responseElements.assumedRoleUser.assumedRoleId"

2. ConsoleLogin fires (user agent will be your web browser) + SwitchRole fires (when accessing a different aws account)
2.a. responseElements.assumedRoleUser.assumedRoleId (from above) == userIdentity.principalId

3. Events fire when interacting with AWS Services, in my case: "LookupEvents"
3.a. responseElements.assumedRoleUser.assumedRoleId (from above) == userIdentity.principalId


You shouldn't need to do anything with ASIA- access key id unless you're tracing down IAM user -> STS assumed role. Since the original user comes from an SSO dance, you can work purely with assumedRoleId, I believe.

I prefer not doing SSO for this reason, so I only have to worry about one credential-link path, and it's one that I fully understand and have control over every part of.

12 rats tied together fucked around with this message at 18:35 on Aug 31, 2021

12 rats tied together
Sep 7, 2006

phone posting so forgive the terse formatting: short answer yes. the terraform thing for this is called a "reference to resource attribute" and you can find it in the docs under terraform language -> expressions -> like halfway down

as long as your resource is defined in the same file (not exactly correct but trying to keep it simple), you do not need to have terraform output anything for a "next step"

the value proposition from terraform is that it is smart enough to look at all of your resource references and go "oh, $that needs to happen first so that we can realize the value for $this" and then do everything for you in the right order

12 rats tied together
Sep 7, 2006

terraform is the most popular tool in this space, yeah. i prefer pulumi to the aws cdk but they are both fine

e: if you decide you want to get a job terraform touching i would look up the following addon tools asap to make your eventual job suck less:

- terraform-landscape
- terragrunt

12 rats tied together fucked around with this message at 01:20 on Sep 3, 2021

Adbot
ADBOT LOVES YOU

12 rats tied together
Sep 7, 2006

I don't think there exists a "best practice" in this space, yet. EC2 lets you run cloud-init through userdata which is one of the most portable and effective tools in this space, I highly recommend getting started with it. In larger environments it has some ergonomics issues that make it difficult to scale past a certain size, geographical, or complexity threshold, but it's very very good and even in those larger environments is often still used.

The "terraform only" answer for that is provisioners, they're pretty bad though (more info in the link). Another solution in this space is to build an AMI with your nginx stuff already configured, and then just launch that AMI. There are a bunch of ways to build AMIs but packer is a great place to start.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply