Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Comradephate
Feb 28, 2009

College Slice
elasticache is just redis.

redis is a fine message queue, though. It is, however, a very bad database.

Adbot
ADBOT LOVES YOU

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
redis is a fine place to store things you don’t care about. if you don’t care about what you’re putting in the queue, then yes it’s good for that. If your application’s correctness relies on a message being received then redis is a bad choice

Volguus
Mar 3, 2009

Ploft-shell crab posted:

redis is a fine place to store things you don’t care about. if you don’t care about what you’re putting in the queue, then yes it’s good for that. If your application’s correctness relies on a message being received then redis is a bad choice

What would be a good and reliable message queue then?

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
It turns out Celery supports Postgres (and Sqlite, Mysql, etc) as a backend, and my app is nowhere near ~~webscale~~ and already uses Postgres, so I'm just gonna use that. One less moving part to worry about.

Votlook
Aug 20, 2005

Volguus posted:

What would be a good and reliable message queue then?

RabbitMQ or ActiveMQ, maybe Kafka if you are webscale.

Votlook fucked around with this message at 22:58 on May 13, 2019

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

Votlook posted:

RabbitMQ or ActiveMQ, maybe Kafka if you are webscale.

Reliable was one of the requirements

Cancelbot
Nov 22, 2006

Canceling spam since 1928

RabbitMQ surely is consistent though? even if it misses out the whole "available" and "partition tolerant" parts of the fire CAP triangle.

Ask me how we run a 3 node RabbitMQ cluster in loving Windows and watch it burn because Windows cluster aware updating will fail to work or occasionally drop the RabbitMQ disks.

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

Cancelbot posted:

RabbitMQ surely is consistent though? even if it misses out the whole "available" and "partition tolerant" parts of the fire CAP triangle.

Ask me how we run a 3 node RabbitMQ cluster in loving Windows and watch it burn because Windows cluster aware updating will fail to work or occasionally drop the RabbitMQ disks.

Holy poo poo that sounds awful.

I used to operate multiple clusters running 3.6.5 which was before they distributed the stats db, so under heavy load the node where the stats db landed would continue to eat up memory until it OOMed and the cluster partitioned. At one point we had a runbook for memory usage on rabbit nodes that basically linked to this page: https://www.rabbitmq.com/management.html#stats-db and told the person on-call to keep trying to reset the stats db until it actually took.

That was the worst on-call experience of my entire career. Of course even though upgrading to a later version would have fixed it, the engineering team decided to migrate all of their queues to SQS which meant we finally got to murder rabbit. That was the best on-call experience of my entire career.

Docjowles
Apr 9, 2009

We run Kafka for a bunch of different use cases and it has been pretty great.

It was a PAIN IN THE rear end to get to this point, dealing with all sorts of bugs and config tunings and caveats. But once we finally got the drat thing into a steady state, it just runs and we haven't had problems with it in like a year. In fairness a lot of the annoying poo poo was actually the requisite zookeeper cluster and not Kafka itself. This was also all version ~1.0 and I think the operations story has improved significantly since then.

I have nothing good to say about RabbitMQ.

chutwig
May 28, 2001

BURLAP SATCHEL OF CRACKERJACKS

Votlook posted:

RabbitMQ or ActiveMQ, maybe Kafka if you are webscale.

Has anyone tried NATS? ActiveMQ comes with trigger warnings for me after dealing with poorly tuned deployments being accessed by garbage application code, and RabbitMQ naturally seems to decay to a point where your only option is to shut it down, nuke the Mnesia directory everywhere, and start it all back up again.

Judge Schnoopy
Nov 2, 2005

dont even TRY it, pal
VMware's NSX platform relies on RabbitMQ. It shouldn't surprise you that I have nightly checks for rabbitmq health and a standard workflow to rebuild the comm channels.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
My company has been running NATS for about a year now and even after an outage due to a rather mundane (but subtle) bug in NATS we contributed a fix for upstream it’s still been a better experience than RabbitMQ overall for the same functional use case after 3 years of experience running on premise installations and hosted. The whole company has decidedly dropped RabbitMQ completely and is about to migrate the last service off of it to NATS. Our scale is nothing like what most others probably deal with routinely (single node NATS without bothering with multi-tenancy? Yeah....) but I’ll accept that for being paged maybe once per year. However, I’m familiar with much, much larger organizations that haven’t hit the scale we routinely see with NATS mostly because we went live with real production workloads rather than some half-rear end POCs.

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Volguus posted:

What would be a good and reliable message queue then?

Persistent message queues at the center of your system — your “enterprise message bus” — add an unnecessary level of indirection and add a centralized point of failure. Use service discovery for sending messages and/or a reliable persistent store if you have any need to handle your messages transactionally.

Nomnom Cookie
Aug 30, 2009



Blinkz0rz posted:

That was the worst on-call experience of my entire career. Of course even though upgrading to a later version would have fixed it, the engineering team decided to migrate all of their queues to SQS which meant we finally got to murder rabbit. That was the best on-call experience of my entire career.

SQS is good. You put messages in and take messages out and it doesn't fall over ever.


Ploft-shell crab posted:

Persistent message queues at the center of your system — your “enterprise message bus” — add an unnecessary level of indirection and add a centralized point of failure. Use service discovery for sending messages and/or a reliable persistent store if you have any need to handle your messages transactionally.

Something has gone badly wrong if calling a persistent store "reliable" isn't trivially true.

Nomnom Cookie
Aug 30, 2009



SNS is also good.

Glue is not good.

Route 53 is good.

S3 is pretty good.

Data Pipeline is bad.

EC2 kinda sucks for complexity, but it's a lot cheaper than Fargate.

Methanar
Sep 26, 2013

by the sex ghost
Just use SQS or google's equivalent if you're in the cloud. vendor lock in be damned.

poo poo like 'oh hey our homegrown messaging system is fundamentally flawed and we're just going to be constantly losing messages forever because of it' is a complete waste of everyone's time.

Bruegels Fuckbooks
Sep 14, 2004

Now, listen - I know the two of you are very different from each other in a lot of ways, but you have to understand that as far as Grandpa's concerned, you're both pieces of shit! Yeah. I can prove it mathematically.

Kevin Mitnick P.E. posted:

Something has gone badly wrong if calling a persistent store "reliable" isn't trivially true.

Not that I'm agreeing with that other guy about message queues, but it appears to be surprisingly difficult for this web 3.0 gaiden poo poo to enforce read after write consistency on stuff like redis/rabbitmq/etc. once you start introducing HA solutions.

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Kevin Mitnick P.E. posted:

Something has gone badly wrong if calling a persistent store "reliable" isn't trivially true.

and yet that’s the operational reality of running rabbit/activemq/redis in my experience. something bad happens in the queue - it fills up, slows down, or is losing messages somehow, so you bounce or purge it to fix it. I hope your devs took into account that their “reliable” queue could drop everything on the floor at any time.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Methanar posted:

Just use SQS or google's equivalent if you're in the cloud. vendor lock in be damned.

poo poo like 'oh hey our homegrown messaging system is fundamentally flawed and we're just going to be constantly losing messages forever because of it' is a complete waste of everyone's time.
SQS is a good fit for things like notification processing or deferred task queuing, but it's an extremely poor fit for generalized messaging between application components. Send latency is good but not great, but receive latency might be in the 1s+ range under normal operating conditions. Throughput-wise, with 25 consumers/producers you might hit 80,000 transactions per second. Compare to Kafka, which in this configuration would easily be able to hit 10M+ messages per second.

Ploft-shell crab posted:

and yet that’s the operational reality of running rabbit/activemq/redis in my experience. something bad happens in the queue - it fills up, slows down, or is losing messages somehow, so you bounce or purge it to fix it. I hope your devs took into account that their “reliable” queue could drop everything on the floor at any time.
Running a reliable queue does not itself get you reliability across your messaging or queuing stack. Defense in depth, including understanding operational procedures in advance so you don't have to routinely drop things on the floor, is the only viable option.

Docjowles posted:

We run Kafka for a bunch of different use cases and it has been pretty great.

It was a PAIN IN THE rear end to get to this point, dealing with all sorts of bugs and config tunings and caveats. But once we finally got the drat thing into a steady state, it just runs and we haven't had problems with it in like a year. In fairness a lot of the annoying poo poo was actually the requisite zookeeper cluster and not Kafka itself. This was also all version ~1.0 and I think the operations story has improved significantly since then.

I have nothing good to say about RabbitMQ.
ZK tuning has gotten a lot easier with recent Kafka, because they re-did the protocol so that partition offsets are stored in a Kafka topic themselves, and only the Kafka brokers need to speak to ZK. It's made things a lot less chatty. (For reference, our prod cluster has nine Kafka brokers and several thousand producers and consumers working on the same topic. To call this an order of magnitude difference in load is an understatement.)

Also, already-connected consumers and producers will actually tolerate a short complete ZooKeeper outage as long as the Kafka cluster topology doesn't need to change.

Vulture Culture fucked around with this message at 15:18 on May 14, 2019

JHVH-1
Jun 28, 2002

Kevin Mitnick P.E. posted:

SNS is also good.

Glue is not good.

Route 53 is good.

S3 is pretty good.

Data Pipeline is bad.

EC2 kinda sucks for complexity, but it's a lot cheaper than Fargate.

The one stack I moved to Fargate cut our costs by at least 66% versus EC2.
All this stuff is situational and you just gotta do your research and testing. Its fine seeing what other people used but there are so many variables depending on what you need to do it isn't always good vs bad.

Comradephate
Feb 28, 2009

College Slice

Kevin Mitnick P.E. posted:

Glue is not good.

not empty quoting

Docjowles
Apr 9, 2009

What do you all hate about Glue? One of our dev teams is about to start using it heavily, so this could be interesting. So far their only complaint is that the documentation is a total tire fire.

Hughlander
May 11, 2005

Vulture Culture posted:

SQS is a good fit for things like notification processing or deferred task queuing, but it's an extremely poor fit for generalized messaging between application components. Send latency is good but not great, but receive latency might be in the 1s+ range under normal operating conditions. Throughput-wise, with 25 consumers/producers you might hit 80,000 transactions per second. Compare to Kafka, which in this configuration would easily be able to hit 10M+ messages per second.

At that scale aren't you looking at Kinesis not SQS?

Stubb Dogg
Feb 16, 2007

loskat naamalle

chutwig posted:

Has anyone tried NATS? ActiveMQ comes with trigger warnings for me after dealing with poorly tuned deployments being accessed by garbage application code, and RabbitMQ naturally seems to decay to a point where your only option is to shut it down, nuke the Mnesia directory everywhere, and start it all back up again.
We've been using NATS with some pretty heavy loads and it's been reliable and performance has been great. Only thing to watch out is possible latency issues if you have both constant high volume stream of small messages and then push some very large request-reply messages.

12 rats tied together
Sep 7, 2006

Kevin Mitnick P.E. posted:

[...]
EC2 kinda sucks for complexity, but it's a lot cheaper than Fargate.
In my experience ec2 complexity is easier to manage if you totally abandon trying to use the ec2 api at all and just shove everything into autoscaling groups and suspend all of the scaling processes.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Hughlander posted:

At that scale aren't you looking at Kinesis not SQS?
Possibly. Kinesis is really intended as an ingest mechanism for S3 or RedShift, with hooks to invoke Lambda along the way. Its performance characteristics are really well-suited to event sourcing and data ingest, but it's fairly high-latency for a real-time messaging system.

It's 2019, and with modern service mesh approaches, it's not clear to me that this kind of architecture is terribly valuable to most people anymore. It's probably most important to folks who need robust real-time signaling for things like WebRTC at scale, which is somewhat in conflict with the durability guarantees provided by a system like Kafka.

I was going to write something about how I'm the dumbass streaming realtime video through Kafka, but it turns out Amazon does this as a product on Kinesis now.

Hughlander
May 11, 2005

Vulture Culture posted:

Possibly. Kinesis is really intended as an ingest mechanism for S3 or RedShift, with hooks to invoke Lambda along the way. Its performance characteristics are really well-suited to event sourcing and data ingest, but it's fairly high-latency for a real-time messaging system.

It's 2019, and with modern service mesh approaches, it's not clear to me that this kind of architecture is terribly valuable to most people anymore. It's probably most important to folks who need robust real-time signaling for things like WebRTC at scale, which is somewhat in conflict with the durability guarantees provided by a system like Kafka.

I was going to write something about how I'm the dumbass streaming realtime video through Kafka, but it turns out Amazon does this as a product on Kinesis now.

Ouch. My last Kafka experience was GDPR anonymouzation with up to 6 hour latency on the queues

Comradephate
Feb 28, 2009

College Slice

Docjowles posted:

What do you all hate about Glue? One of our dev teams is about to start using it heavily, so this could be interesting. So far their only complaint is that the documentation is a total tire fire.

The implementation is also a total tire fire. Errors/failures are inscrutable, it only supports python 2.7, and the networking story is completely bonkers. Where your ENI ends up is controlled indirectly by what device your database connector is associated with. We have had two separate minor outages caused by idiots with too many permissions making networking changes to try to get glue working.

I actually don’t understand how glue can be such a shitshow compared to lambda.

StabbinHobo
Oct 18, 2002

by Jeffrey of YOSPOS
data is always way way harder than code

Nomnom Cookie
Aug 30, 2009



Docjowles posted:

What do you all hate about Glue? One of our dev teams is about to start using it heavily, so this could be interesting. So far their only complaint is that the documentation is a total tire fire.

Documentation is like the one good thing about AWS. AWS things are complicated, with weird limitations, surprising behaviors, and baffling gaps in functionality. BUT that's all documented and communing long enough with the developer guide lets you bash together a thing that works. Bad docs is enough of a reason on its own to avoid an AWS thing IMO.

That aside, I tried to use Glue's crawling thing. A custom regex classifier for ELB logs didn't work. Yes, I spent a few hours testing the regex in various ways before concluding the problem wasn't me. Then I turned it loose on a large partitioned Parquet table to see how it would handle a trivial use case. It created 100,000 tables in Athena and would've made more if that wasn't the hard cap on tables per DB. The catalog API works well, though. Deleting the garbage tables only took 15 minutes or so!

After that experience I don't think I'll ever touch Glue ever again, if I can possibly avoid it.

12 rats tied together posted:

In my experience ec2 complexity is easier to manage if you totally abandon trying to use the ec2 api at all and just shove everything into autoscaling groups and suspend all of the scaling processes.

do you want to use on-demand or spot in your ASG. how much spot? do you want multiple instance pools. what's your bid. are your autoscaled nodes hooked into an ALB. tags, subnets, security groups, AMI, key pair. After making a few dozen decisions you're ready to spin up an instance. To be clear, I'm not complaining. EC2 is flexible and that costs complexity.

JHVH-1 posted:

The one stack I moved to Fargate cut our costs by at least 66% versus EC2.
All this stuff is situational and you just gotta do your research and testing. Its fine seeing what other people used but there are so many variables depending on what you need to do it isn't always good vs bad.

Right, so...what was going on there that you had like 80% overhead on your EC2 stuff? Same capacity on Fargate is significantly more expensive than EC2 ondemand. At tiny scale or low duty cycle there are decisions to be made on EC2 vs Fargate vs Lambda, but I don't see a way you're going to move 200 vcpu of containers to Fargate and save money.

Gyshall
Feb 24, 2009

Had a couple of drinks.
Saw a couple of things.

StabbinHobo posted:

data is always way way harder than code

In my "spare" time, I've been taking up R an similar languages just to see how I can more effectively handle data sets in a more immutable way in our teams pipelines. Data is still a very weak part of my toolbox because every app/service I've ever worked on choked on a database with an empty schema :smith:

12 rats tied together
Sep 7, 2006

Kevin Mitnick P.E. posted:

do you want to use on-demand or spot in your ASG. how much spot? do you want multiple instance pools. what's your bid. are your autoscaled nodes hooked into an ALB. tags, subnets, security groups, AMI, key pair. After making a few dozen decisions you're ready to spin up an instance. To be clear, I'm not complaining. EC2 is flexible and that costs complexity.

Totally - it gets even worse when you start running into capacity issues on the AWS side like, we literally cannot sell you any more i3.8xlarge in this AZ, but your application requires a cluster placement group, so you're SOL.

Shoving everything into an ASG lets you defer on capacity and scaling issues until AWS is ready to let you give them money, you can sit there and let the group fail to launch you the i3s until some of them open up. It's basically an absolute requirement from me from an automation perspective, I hate having to revert feature branch merges because we couldn't tell (and there's no API for asking) that a particular environment is at capacity.

It's also nice to be able to answer the "how do we launch more of these?" question with "just change the numbers on the asg". Lifecycle hooks are great too. I had an internal application launch requested in like a sub-30 node proof of concept configuration and then a followup issue to scale the application to 1024 nodes, the only thing I had to do was change some numbers and mark the ticket as done.

Complex subnet mutations are totally possible too: We filled up subnet X, but we need 100 more nodes, and we need you to not turn off the old nodes until we can shift traffic over to the new nodes in the bigger subnet. Just add the new subnet, change the number, wait for your folks to shift traffic, set the termination policy to OldestInstance, lower the number, wait for the ASG to terminate the instances, remove the old subnet, and then turn the number back up. You can do it all in maybe 30 minutes and only have to merge 2 pull requests into your terraform/cloudformation/whatever repository.

The best part is though, if you can't do it in 30 minutes, you just merge the first pull request and wait. Letting the ASG schedule instances for you works really well with declarative infrastructure management.

Nomnom Cookie
Aug 30, 2009



12 rats tied together posted:

Totally - it gets even worse when you start running into capacity issues on the AWS side like, we literally cannot sell you any more i3.8xlarge in this AZ, but your application requires a cluster placement group, so you're SOL.

Shoving everything into an ASG lets you defer on capacity and scaling issues until AWS is ready to let you give them money, you can sit there and let the group fail to launch you the i3s until some of them open up. It's basically an absolute requirement from me from an automation perspective, I hate having to revert feature branch merges because we couldn't tell (and there's no API for asking) that a particular environment is at capacity.

It's also nice to be able to answer the "how do we launch more of these?" question with "just change the numbers on the asg". Lifecycle hooks are great too. I had an internal application launch requested in like a sub-30 node proof of concept configuration and then a followup issue to scale the application to 1024 nodes, the only thing I had to do was change some numbers and mark the ticket as done.

Complex subnet mutations are totally possible too: We filled up subnet X, but we need 100 more nodes, and we need you to not turn off the old nodes until we can shift traffic over to the new nodes in the bigger subnet. Just add the new subnet, change the number, wait for your folks to shift traffic, set the termination policy to OldestInstance, lower the number, wait for the ASG to terminate the instances, remove the old subnet, and then turn the number back up. You can do it all in maybe 30 minutes and only have to merge 2 pull requests into your terraform/cloudformation/whatever repository.

The best part is though, if you can't do it in 30 minutes, you just merge the first pull request and wait. Letting the ASG schedule instances for you works really well with declarative infrastructure management.

Oh yeah, ASG is key. One guy I worked with a while back preferred to put everything in ASG, even single instances. His stated reason (what if AWS terminates an instance out from under us?) never did happen in the time we were doing that, but it was awfully convenient in other ways. Mainly from CloudFormation updating a launch configuration instead of messing with our running instances.

Votlook
Aug 20, 2005

Kevin Mitnick P.E. posted:

Oh yeah, ASG is key. One guy I worked with a while back preferred to put everything in ASG, even single instances. His stated reason (what if AWS terminates an instance out from under us?) never did happen in the time we were doing that, but it was awfully convenient in other ways. Mainly from CloudFormation updating a launch configuration instead of messing with our running instances.

Yeah I've used a setup like that with pre baked AMI's, on update we would just configure the launch configuration to use the new AMI, and the ASG would take care of replacing the old instance with the new instance. Very convenient.

Cancelbot
Nov 22, 2006

Canceling spam since 1928

Launch templates + EC2/spot fleets are great too - "When i scale, i want whatever you have to fill the gap ordered by cheapest/my preference". Most of the work is in instance provisioning anyway so its worth spending that extra 10% of effort to put it into an ASG.

12 rats tied together
Sep 7, 2006

Cancelbot posted:

Launch templates + EC2/spot fleets are great too - "When i scale, i want whatever you have to fill the gap ordered by cheapest/my preference". Most of the work is in instance provisioning anyway so its worth spending that extra 10% of effort to put it into an ASG.

I'm really excited for EC2 fleets to get CloudFormation support, yeah. Having to use AWS::AutoScaling::AutoScalingGroup is a little confusing/intimidating for folks we hire that come from, say, a vmware background or something. It would be nice to be able to declare "I want X instances of types Y,Z with properties blah blah" in a template and have it be really obvious what the intent is.

One minor problem is I don't believe you can permanently suspend scaling processes with cloudformation alone. In ansible we just do cloudformation module pushes the stack, stack outputs contains the ASG name, ansible uses the ASG name along with the ec2_asg module to manage the scaling processes. I believe terraform lets you statically set scaling processes on the asg resource which I will begrudgingly admit is rather nice.

Votlook posted:

Yeah I've used a setup like that with pre baked AMI's, on update we would just configure the launch configuration to use the new AMI, and the ASG would take care of replacing the old instance with the new instance. Very convenient.

We have a couple scaling groups that use cloud-init's phone_home directive to reach out to ansible AWX, which allows us to do some interesting things like rolling updates to the entire ASG every time a new node joins, or "restart the brokers for this cluster, one at a time, wait for the prometheus under-replicated partition metric to hit 0 after restarting, restart the ActiveController broker last, push a grafana annotation when we start and another when we finish".

That particular example ended up not being necessary though. Most of the time it's just "add the new node to some config files on the other nodes, restart services on them in small batches, maybe make some api calls to register yourself with other systems".

Comradephate
Feb 28, 2009

College Slice
CloudFormation seems pretty indefensible to me. How can an open source third party solution consistently support new products and features faster, and apply diffs faster than the AWS first-party approach?

JHVH-1
Jun 28, 2002

Comradephate posted:

CloudFormation seems pretty indefensible to me. How can an open source third party solution consistently support new products and features faster, and apply diffs faster than the AWS first-party approach?

They added this feature to the ECS dashboard months ago and I am now waiting for it to be added to cfn just so I can include it in my template https://github.com/aws/containers-roadmap/issues/97#issuecomment-493040608

So even AWS can lag behind their own features it seems.

JehovahsWetness
Dec 9, 2005

bang that shit retarded

JHVH-1 posted:

They added this feature to the ECS dashboard months ago and I am now waiting for it to be added to cfn just so I can include it in my template https://github.com/aws/containers-roadmap/issues/97#issuecomment-493040608

So even AWS can lag behind their own features it seems.

Same here, secret injection from SSM seems like a no-brainer but how the hell did CFN support for this get so far behind the console?

( My dumb wrinkle is the previous devs here stored secrets _in JSON_ in SSM instead of using paths. It's baked into loving everything, so I did this as a workaround until I decide wtf to do in the next iteration: https://github.com/ian-d/ecs-template )

Adbot
ADBOT LOVES YOU

12 rats tied together
Sep 7, 2006

I assume you're talking about terraform but it is my experience that you're making a mostly untrue statement, although to be fair I stopped using terraform sometime in the middle of 0.11.

I could write you a small book about why I use cfn instead of terraform but I'll try to summarize briefly:
- terraform has no real templating support, the tool is extremely WET, which is unacceptable in my opinion
- terraform plan will straight up lie to you and is, in general, not very useful (especially compared to CloudFormation changesets)
- terraform is extremely slow once you push past ~100 resources, much slower than cloudformation would be with a similar set
- multi-state management in terraform is extremely cumbersome compared to what you can accomplish with standard CloudFormation
- I'm having a hard time communicating how much more useful the "stack" is as an api primitive, compared to a terraform state file, again especially in large environments

This is aside from the obvious problems with terraform that they've been attempting to address in 0.12 with the full rewrite of HCL: that ternary expressions are not lazily evaluated, looping is extremely limited, the "sub block" property of a terraform resource is basically unusable.

Terraform is totally fine to use for limited scope environments. Single applications, maybe a small collection of resources. The syntax is reasonably intuitive, "count" is (while limited in functionality) a great piece of common language you can give to a development team who might not be familiar with using a standard templating language on resource declarations. I think we manage somewhere around 200 unique terraform state environments at my current employer, and it works fine if you are working in (at most) 3 or 4 of them. Your parent stack, their parent stack, any siblings to your stack, and maybe you have some child objects. That's all totally fine -- I am strongly of the opinion that its a great tool if you are a consumer of some kind of PaaS facsimile.

Managing the parents and grandparents, especially if they need to contain multiple provider and region combos, is a complete disaster. To try and briefly summarize again: the value proposition of terraform is that you just make declarations about your desired state and terraform handles the dependencies and ordering of tasks in order to reach that state. Terraform encourages to break your states apart into small, isolated states and then link them together with the remote state data source. Terraform has no native mechanisms for declaring dependencies or task ordering across state boundaries.

The realities of actually using the tool in production, in a reasonably complex environment, directly counter the benefits of the tool. You don't have to manage "security group comes before instance", but you have to manage ""AWS" comes before "AWS network" comes before "standardized security groups" comes before "instance of module application" comes before "module children", you have to do the plan -> review -> apply loop for each layer of inheritance one at a time because you can't accurately terraform plan a change in child 4 until children 1, 2, and 3 have been successfully planned and applied. Your work tree of managing the "plan -> review -> apply" loop explodes exponentially as you add parents, children, and especially children at similar levels of inheritance that need to check each other for outputs.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply