Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
12 rats tied together
Sep 7, 2006

Assuming I understand you correctly, AWS::ECS::TaskDefinition takes a list of AWS::ECS::ContainerDefinition objects, which each have an Environment (for plaintext) and Secrets field (for encrypted). The Secrets field has a Name / ValueFrom syntax that maps fairly well to stock K8S api, and the values mapped can come from either AWS Secrets Manager or AWS SSM Parameter Store secrets.

The first time I worked with these constructs we were coming from an existing ansible/cloudformation workflow that mapped 100% to these new concepts, especially because AWS::SecretsManager::Secret and AWS::SSM::Parameter are also valid cloudformation objects.

In our case we took our existing ansible-vault -> k8s secret implementation and swapped it to instead create secrets manager secrets, and then mapped those into the TaskDefinitions just like we would for k8s PodSpec.

Adbot
ADBOT LOVES YOU

ILikeVoltron
May 17, 2003

I <3 spyderbyte!

LochNessMonster posted:

Whats the current standard for spinning up a PoC bare metal k8s cluster to show off some standard capabilities (nothing fancy). Kops, Kubeadm or Kubespray? Or should I just run k3s?

I'd likely use either kubeadm (what I'm currently using) or rancher. I just went through updating a k8s cluster from 1.18.x to 1.20.x using nothing but kubeadm and it was fairly painless

The Fool
Oct 16, 2003


Are there any good existing scaffolding tools for terraform?

The Fool
Oct 16, 2003


The Fool posted:

Are there any good existing scaffolding tools for terraform?

Or something generic like Yeoman, only done in Python or Go since I don’t want to recommend installing node if I don’t have to.

drunk mutt
Jul 5, 2011

I just think they're neat

The Fool posted:

Or something generic like Yeoman, only done in Python or Go since I don’t want to recommend installing node if I don’t have to.

What are you actually expecting to happen? I really do wonder why you'd look for another tool to structure this tooling for you. Are you wanting explicit resource types laid out for you (e.g, GCP/AWS/Azure instances) which include auth policies? If you're wanting to have certain resources provisioned (e.g, GCP instances, AWS instances, etc) you probably just want to look to see if there is a TF module that already exists.

The tool you mentioned is a bit more like a configuration manager. Terraform is not a "one stop fix it all" tool, it's there to provision resources. Yes, you can use some tricks to make it call local functionality, but even Hashicorp stated a few years back that it was more ideal to have your "config manager" (e.g, Ansible) call TF for the provisioning side of things versus using the null resource block.

I do apologize if I'm coming off rude, it's not my intention, I deal with this on a daily and am just wanting to understand what you're expecting.

12 rats tied together
Sep 7, 2006

The best scaffolding tool for terraform is ansible. It doesn't come up very often during searches because if you're going to use terraform you've likely already encountered and chosen not to use ansible, and if you were already using ansible you didn't need terraform in the first place. The integration is there for you and available though, and I've enjoyed using it a lot at past roles when dealing with people who are ideologically biased against ansible for some reason (usually a misunderstanding of YAML).

If you ask terraform people what the best scaffolding tool is, last I heard it was terragrunt. It seems pretty good, but I've never had to use it because I just use ansible instead.

e: To answer the previous question, in the past we used management tools for terraform because the existing language was really bad and lacked support for extremely basic things like "using a variable in a module path", so instead of updating modules with sed we wrapped terraform deployments in ansible so we could use jinja2 to write terraform files with an actual templating language.

I understand this was supposed to get better for terraform in 0.14 but I remember looking at it after the release and finding it to still be extremely limited.

12 rats tied together fucked around with this message at 05:49 on Dec 12, 2020

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

12 rats tied together posted:

The best scaffolding tool for terraform is ansible. It doesn't come up very often during searches because if you're going to use terraform you've likely already encountered and chosen not to use ansible, and if you were already using ansible you didn't need terraform in the first place. The integration is there for you and available though, and I've enjoyed using it a lot at past roles when dealing with people who are ideologically biased against ansible for some reason (usually a misunderstanding of YAML).

If you ask terraform people what the best scaffolding tool is, last I heard it was terragrunt. It seems pretty good, but I've never had to use it because I just use ansible instead.

e: To answer the previous question, in the past we used management tools for terraform because the existing language was really bad and lacked support for extremely basic things like "using a variable in a module path", so instead of updating modules with sed we wrapped terraform deployments in ansible so we could use jinja2 to write terraform files with an actual templating language.

I understand this was supposed to get better for terraform in 0.14 but I remember looking at it after the release and finding it to still be extremely limited.

Do you work at red hat or something? I can't think of a single post you've made that hasn't pushed ansible as a catch-all solution for automation. It's not a bad tool but if someone asks how to scaffold a terraform project the answer isn't some entirely unrelated too jfc

Happiness Commando
Feb 1, 2002
$$ joy at gunpoint $$

Terragrunt is ok. It does make your .tf files more DRY, and it is easier to pass data between modules, but TF HCL still seems like it needs way more polish when it comes to functions and variables. And since Terragrunt is a TF wrapper, there's some really dumb stuff it is powerless to do anything about. For instance, one of their selling points is that in multiple-module situations, you can run terragrunt apply-all and have it apply all your modules with one command. Except if you have dependencies and pass outputs from one module as inputs into another, apply-all only works when the upstream module has already been run. If you are starting with a new deployment (or a new output used downstream) you have two choices:
1) Run terragrunt apply on each of the modules with an output that hasn't been applied yet, because otherwise terraform blows up since the output hasn't been determined yet
2) Make dummy inputs for the parser to validate against, not blow up, and then wait for the real outputs to accept as inputs.

Given how TF tries to do things, that problem makes sense. Of course TF is going to blow up if you've declared an input that doesn't exist yet. But Terragrunt's "hey we can make your life easier!" is dependent on a dumbass workaround to control for how TF tries to do things. And since modules are a pretty widely agreed upon best practice, it's just a really weird flex to say "hey, we make your code DRY!" when they actually mean "well, we make your code DRY-er than if you copy pasted a bajillion lines over and over again. Instead you will only have to type a million lines". And that's great! It's progress! But it's, uh, not a tool that I would hold up as being a really well functioning element of a powerful toolchain that can accomplish things. It's more like an incrementally more useful thing to use as a wrapper that isn't nearly as great as it claims to be and comes with its own tradeoffs.

And from what little I know about Ansible and j2 templates, they are way more powerful. I've read multiple people say that having ansible bootstrap terraform is the right way to do things. That may or may not be the answer to the question that 12 rats posted.

Happiness Commando fucked around with this message at 16:00 on Dec 12, 2020

deedee megadoodoo
Sep 28, 2000
Two roads diverged in a wood, and I, I took the one to Flavortown, and that has made all the difference.


Using ansible to do anything beyond its stated purpose is hell. We currently use ansible to orchestrate cloudformation because someone before my time thought that was a good idea. They were wrong. It is a loving mess.

Granted, cloudformation is a pile of poo poo. And if you’re literally only using the jinja2 templates from ansible to do some very simple templating then I could see it working but even then there is probably a better way.

I love ansible. But only for the purpose of config management.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
Terragrunt has been fine for us. It has essentially all the functions of base Terraform built-in and a few new ones that hook into the Terragrunt lifecycle. We used Terragrunt to replace about 2500 lines of Makefiles for us across 190+ TF state files with way, way too many modules that kind of sucked for a "can we just do a single plan across all of this?" workflow that we had problems doing with our layout. It's Good Enough if you don't have the time to bikeshed through a different workflow and want to punt it to someone that's spent more time on the issue because your team mostly doesn't care.

I'm not sure why there's any reason to add in Ansible or much more. If you need a lot of orchestration and events and hooks and all that kind of stuff your problem is kind of inverted - it should be more how do you fit Terraform into X rather than what would you choose to wrap around and orchestrate Terraform. Unfortunately for us, we use freakin' Jenkins as an orchestration tool but I'm rewriting it all under Salt orchestration because we already use Salt for configuration management and to perform a lot of chores for us in our mutable section of infrastructure (read: all the parts that actually make us money).

12 rats tied together
Sep 7, 2006

Blinkz0rz posted:

Do you work at red hat or something? I can't think of a single post you've made that hasn't pushed ansible as a catch-all solution for automation. It's not a bad tool but if someone asks how to scaffold a terraform project the answer isn't some entirely unrelated too jfc

I do not and would not work for red hat but I have been using ansible at work for the past 7 years. As a couple other posters have mentioned, it's not an entirely unrelated tool, it being pitched as config management is purely a post-acquisition piece of marketing that you shouldn't take too seriously. Red hat wants to sell Tower licenses and support and to do that they're pitching it the best way they know how.

It's an orchestration tool, not a config management tool. You can absolutely use it to orchestrate your terrafom just like you can use it to orchestrate everything else.

deedee megadoodoo posted:


Granted, cloudformation is a pile of poo poo. And if you’re literally only using the jinja2 templates from ansible to do some very simple templating then I could see it working but even then there is probably a better way.
I think you'd be surprised, there is no better way to orchestrate cloudformation that I've come across since like 2015 or so. Cloudformation by itself is definitely an awful tool but using ansible to drive is a top tier workflow in the "declarative cloud provider api" space.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
I think we should also differentiate between workflow and orchestration. Something like Airflow is purely a workflow tool that has DAGs of processing pipelines pushing out stuff constantly. Orchestration kinda by definition requires being able to coordinate between disparate groups of stuff.

Also, for Cloudformation at my last gig I was totally fine using Sceptre and nothing more. Just write my Troposphere code or use Jinja templates to generate YAML and lint it and all that stuff. With different lifecycle hooks it met every need I could think of in our CI/CD pipeline.

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

12 rats tied together posted:

I do not and would not work for red hat but I have been using ansible at work for the past 7 years. As a couple other posters have mentioned, it's not an entirely unrelated tool, it being pitched as config management is purely a post-acquisition piece of marketing that you shouldn't take too seriously. Red hat wants to sell Tower licenses and support and to do that they're pitching it the best way they know how.

It's an orchestration tool, not a config management tool. You can absolutely use it to orchestrate your terrafom just like you can use it to orchestrate everything else.

It's absolutely a config management tool that was written as an alternative to chef and puppet and saying anything else is just revisionist history.

quote:

I think you'd be surprised, there is no better way to orchestrate cloudformation that I've come across since like 2015 or so. Cloudformation by itself is definitely an awful tool but using ansible to drive is a top tier workflow in the "declarative cloud provider api" space.

Cdk is far better than templating cloudformation yaml or json via ansible and jinja

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
The primary reason for Ansible existing was Michael DeHaan hated having to configure some master server and agent and all that to run some stuff on servers and wanted to use ssh for transport to get something up and running fast. Stuff like coordinating across multiple groups of servers, accelerated mode, pull mode, and local execution modes were all done as a result of the consequences of the design choice. Meanwhile I’m sitting around with Saltstack not really caring about how crappy most enterprise environments’ ssh setups are that took me 100+ hours to get a decent inventory setup. At a previous team I literally had an FTE going around fixing our horrifically managed ssh setup for 3 weeks across 6k+ hosts. We setup Saltstack in one afternoon and never looked back at attempting to login with ssh again.

12 rats tied together
Sep 7, 2006

The very earliest versions of both the project readme and website explicitly call it out as a task executor and multi node orchestration tool. These things should have commit dates sometime in 2012-2013 back when this stuff was not talked about nearly as widely as it is today.

Certainly config management is a first class workflow in the tool but to say that ansible is explicitly a chef/puppet alternative is delusional. It's an entirely different mode of operation which is usable for the full set of tasks that you might use chef/puppet for, but that's only a single dimension of it.

I can't synthesize 7 years of github discussion for you without writing a small novel so you're just going to have to trust me on this one -- the only people who watched ansible release, the community start to form, and the features startto roll in that reacted with "this is an alternative to puppet" were simpletons.

CDK/Pulumi are cool but unfortunately still a subset of features from a sufficiently robust ansible workflow. Writing cloudformation documents is actually the easy part, the hard part is stringing together static and dynamic data sources and then running the right things in the right order. How do we scale out to a new region? What happens when we lose a node? How do we translate between cloudformation and azure arm? How do we configure our end of the AWS Direct Connect that only sometimes exists and, when it does, needs to be configured on a different device each time?

You can learn and then string together 5 or 6 different tools for this or you can just pip install ansible and get to work. I hope this helps illustrate why I bring it up in every single IT post I make.

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

12 rats tied together posted:

You can learn and then string together 5 or 6 different tools for this or you can just pip install ansible and get to work. I hope this helps illustrate why I bring it up in every single IT post I make.

You learned how to use a pocket knife and now, to you, every problem can be solved with one regardless of whether there's a more appropriate tool.

12 rats tied together
Sep 7, 2006

I've been trying to think of an appropriate tool metaphor for a while and I can't find one that works. The orchestration metaphor does seem to be the most correct: the players, instruments, and sheet music that comprise an orchestra change over time and between pieces. Regardless of their exact composition, they all benefit from a person who drives. Following the metaphor, there is also both a size at which a single orchestrator becomes insufficient, and a complexity at which a player's ability to follow instructions becomes insufficient. I see a lot of people switch to a "choreography" metaphor to describe these situations which I like a lot.

The pocket knife comparison is unfortunately backwards -- terraform's null_resource -> local_exec is a good example of what you're saying. I've also had to shoot down the idea of "a chef node that runs aws api commands on convergence" at work which is another instance of this pattern. Pushing ansible for every issue is IMHO the opposite, instead of using a pocket knife to solve band saw problems, it's suggesting that we build an entire manufacturing silo that produces output in the desired shape.

The Fool
Oct 16, 2003


Ok, my question was pretty vague and I think some of you were searching for an xy problem.

The team that I’m on is building an infrastructure pipeline workflow so that our app teams can just say “I want resource1, resource2, and it needs to be load balanced” and our tools take that information and builds out all the required infrastructure to make it work, the nsg’s, the storage accounts, makes sure asp’s are in the right ase’s and a bunch of other stuff. It also enables easy promotion from dev to load testing to prod.

Right now the app teams interact with this by writing terraform using modules that we built, which when checked in trigger azure devops pipelines and tfe.

This is having a heavier support burden for teams that are less familiar with tf and we are having to troubleshoot and help them deploy their environments.

My idea was to explore the possibility of having the app teams use a scaffolding/code generation tool to ask them a couple questions then generates a base folder structure and tf files that would deploy what they need based on some common design patterns.

Mostly inspired by web dev tools like create-react-app and django.

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

The Fool posted:

Ok, my question was pretty vague and I think some of you were searching for an xy problem.

The team that I’m on is building an infrastructure pipeline workflow so that our app teams can just say “I want resource1, resource2, and it needs to be load balanced” and our tools take that information and builds out all the required infrastructure to make it work, the nsg’s, the storage accounts, makes sure asp’s are in the right ase’s and a bunch of other stuff. It also enables easy promotion from dev to load testing to prod.

Right now the app teams interact with this by writing terraform using modules that we built, which when checked in trigger azure devops pipelines and tfe.

This is having a heavier support burden for teams that are less familiar with tf and we are having to troubleshoot and help them deploy their environments.

My idea was to explore the possibility of having the app teams use a scaffolding/code generation tool to ask them a couple questions then generates a base folder structure and tf files that would deploy what they need based on some common design patterns.

Mostly inspired by web dev tools like create-react-app and django.

I got what you were asking. Our infra team generally delegates ownership to dev teams so most of the team TF repos are kinda choose your own adventure wrt organization. However, at the account management level (i.e. for organization config or idp config), they use cookiecutter to generate a well-defined structure. I bet you could do something similar as a way to let teams bootstrap their setups.

Docjowles
Apr 9, 2009

necrobobsledder posted:

The primary reason for Ansible existing was Michael DeHaan hated having to configure some master server and agent and all that to run some stuff on servers and wanted to use ssh for transport to get something up and running fast. Stuff like coordinating across multiple groups of servers, accelerated mode, pull mode, and local execution modes were all done as a result of the consequences of the design choice. Meanwhile I’m sitting around with Saltstack not really caring about how crappy most enterprise environments’ ssh setups are that took me 100+ hours to get a decent inventory setup. At a previous team I literally had an FTE going around fixing our horrifically managed ssh setup for 3 weeks across 6k+ hosts. We setup Saltstack in one afternoon and never looked back at attempting to login with ssh again.

Saltstack is criminally underrated imo. By the time it became mature Puppet and Chef had already sucked up all the oxygen in the space and it never gained traction. But I really enjoyed using it at a past job. The config management part wasn’t even really the primary benefit, as you said the ability to easily target and run operations against arbitrary groups of servers was super nice.

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
the reason I think that config management tools all suck is because they only ever solve one out of the two problems they are supposed to.

one problem is orchestration, when you need control over how and when something happens in your infrastructure which is not going to be immutable. this is something that ansible provides.

the other problem is immutable/stateless/unorchestrated configuration which is provided by tools like chef.

as an operator sometimes i need one or the other, but all the tools I know of will only give me one and, reasonably, most groups only want to use one tool. i think it’s easier to try and pretend you have immutability in ansible than it is to shoehorn orchestration into chef, but I would love a tool that could give me both!

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
The whole orchestration / configuration management space is a clusterfuck and the problems I ran into when I worked for a start-up that wound up being the inspiration for Rundeck (for better or worse) are not going to be solved by some clever engineers given the millions of man hours poured into the horrible nature of systems administration and lack of fucks by overworked developers. No amount of clever tooling will be able to fix software architectures that's broken by design which is the default state of software outside well run engineering organizations and open source projects with tons of eyes on it. In the easy, simple case of face-rolling servers or rolling them back it doesn't matter what you pick - that's the for loop of orchestration. It's not a surprise to me that people have gotten more results by using containers and stateless software wherever possible as the primary deliverable artifact from developers with the whole 12F approach and all.

Salt's language idioms for orchestration and configuration are pretty similar to Ansible. The difference is that you need to have constructs like events and external state actors to have a One-Deployer-to-Rule-Them-All approach to a lot of systems and it's not intuitive how to write a semaphore or break out into an imperative subroutine to push along system state to something sane before rolling back several checkpoints. And events in Salt are built-in and you can watch events on the bus pretty easily and write some pretty drat responsive deployment logic over the bus. I couldn't find anything like that with Chef or Ansible before and had to clumsily write some random-rear end sentinel values to Redis and hope and pray that the random-rear end server out there that ran a play could write back to the Redis server during the run. Salt's control plane is legit independent of your infrastructure being managed and I can just point people to the docs instead of reinventing more wheels that suck by default.

The whole problem with convergent digraphs like Puppet and Chef is that they slow down like crazy with complex graphs that can self-modify themselves and there's no guarantee that they'll converge in one shot either. There's been plenty of situations I've seen in any such system where multiple runs were necessary due to late binding or non-deterministic variable resolution that also happens in Terraform which is also another flawed DSL because of HCL's heritage with Puppet's DSL.

LochNessMonster
Feb 3, 2005

I need about three fitty


Speaking of ansible, I’ve been trying to do something that feels like a common and simple use case but I’m unable to figure it out.

I’m running a playbook to update some Elasticsearch servers and need to restart the service. Since ES doesn’t like it if all nodes get restarted all together I’d like to do a rolling restart. ES takes some time to start properly after systemd responds that the service has been succesfully started. Unfortunately ansible moves onto the next node when that happens. To prevent having multiple nodes down at the same time I want to do a health check before moving onto restarting the next node.

At first I tried to use include_tasks as a handler for restarting and used a seperate file that included a block of tasks with throttle, but you can’t throttle a block.

An alternative is running the playbook in serial but that’s really only necessary for the restart of the service. I guess that can work now but when we grow the cluster it’ll scale poorly and take quite some time.

Any ideas on how to loop over 2 specific tasks together?

tortilla_chip
Jun 13, 2007

k-partite
serial supports percentages of a group, is that sufficient?

LochNessMonster
Feb 3, 2005

I need about three fitty


tortilla_chip posted:

serial supports percentages of a group, is that sufficient?

Not right now, cluster size is still too small. Might work when we scale up in Q2/Q3 2021.

I figured this would be fairly simple, but maybe I should just run it as serial: 1 and move to percentages later.

tortilla_chip
Jun 13, 2007

k-partite
include_role sounds like it might also work for your use case, it's been a minute since I've done flow control logic in Ansible (and honestly one of its shortcomings in my experience).

https://docs.ansible.com/ansible/latest/collections/ansible/builtin/include_role_module.html#examples

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS
i haven't done much complex stuff in ansible but couldn't you curl the es health endpoint as a blocking operation until it reports ready and then continue to the next task?

we did something similar with our chef cookbooks around ensuring consul availability. i think we had to ultimately write a little bit of ruby to do it but it wasn't particularly painful.

if percentage waits until playbook completion for a given node you should be fine as long as your percentage groups are small enough that you're not negatively impacting the cluster.

xzzy
Mar 5, 2009

Blinkz0rz posted:

i haven't done much complex stuff in ansible but couldn't you curl the es health endpoint as a blocking operation until it reports ready and then continue to the next task?

That's how I did it. Well, not with ansible, but that was our approach. We set up a job that checked the cluster was green before running any update/restart.

LochNessMonster
Feb 3, 2005

I need about three fitty


Blinkz0rz posted:

i haven't done much complex stuff in ansible but couldn't you curl the es health endpoint as a blocking operation until it reports ready and then continue to the next task?

we did something similar with our chef cookbooks around ensuring consul availability. i think we had to ultimately write a little bit of ruby to do it but it wasn't particularly painful.

if percentage waits until playbook completion for a given node you should be fine as long as your percentage groups are small enough that you're not negatively impacting the cluster.

The problem is that task 1 restarts the service and task 2 is the health check. No matter which solution I’m trying, ansible keeps running task 1 on all nodes before doing the health check (which it should do after each node).

The restart service task is currently a handler and it didn’t work if I let handler 1 notify handler 2.

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

LochNessMonster posted:

The problem is that task 1 restarts the service and task 2 is the health check. No matter which solution I’m trying, ansible keeps running task 1 on all nodes before doing the health check (which it should do after each node).

The restart service task is currently a handler and it didn’t work if I let handler 1 notify handler 2.

this feels like something that might require a semaphore somewhere outside of the process.

another possible option, if this is in the cloud, would be to set it up as a proper asg with a status check on the es health endpoint and write some quick bash to terminate each old node, wait until the asg capacity is back to full, then terminate the next one, etc.

LochNessMonster
Feb 3, 2005

I need about three fitty


Blinkz0rz posted:

this feels like something that might require a semaphore somewhere outside of the process.

another possible option, if this is in the cloud, would be to set it up as a proper asg with a status check on the es health endpoint and write some quick bash to terminate each old node, wait until the asg capacity is back to full, then terminate the next one, etc.

It’s fully on prem unfortunately and working with ansible is considered black magic around here.

Also one of the reasons why I didn’t renew my contract with this client. There is virtually automation, provisioning VMs can take months (I onlu wish I was joking) as there is no spare capacity.

It’s my last feature to deliver before focussing on some knowledge transfer and I have a strong urge to do it properly. But at this point I’m considering to just run the whole thing in serial and advise them to change it to percentile in the future so I can be done with it.

LochNessMonster fucked around with this message at 17:46 on Dec 13, 2020

12 rats tied together
Sep 7, 2006

Looping on two tasks together breaks ansible's execution model and the guarantees it tries to give you about when it will do things, you can root around in the github issues for some further discussion from people with more knowledge about the task scheduler internals if you're interested, but that's my takeaway here after looking into this at $lastjob for some people.

What you're describing though, as far as I can tell, is not a situation that would need "with_together" or similar -- you have a restart task and a health check task. If the health check fails you don't want to retry the restart task, you want to fail the play or trigger some handler. You can do this with normal handlers by just including both handlers in your notify block(s):

code:
- name: manage es config files
  template:
    whatever: whatever
  notify:
    - restart elasticsearch
    - wait for elasticserach
Keep in mind the usual warnings about handler execution order here -- that is, the handlers execute in the order that they appear in your handlers file, not the order that they are listed in the notify block.

You also have a separate issue: you want the handlers to run on a particular chunk of your play at once, the feature for this is serial: 1 (or whatever) as you've noted. I'm going to guess that by "run the whole thing in serial" that you're expecting a huge speed decrease from this? If that's true, please accept this quick reminder that playbooks can contain multiple plays, which can execute with different strategies, different configurations for the same strategy, different connection mechanisms, etc.

code:
- name: do a bunch of stuff to this elasticsearch cluster
  hosts: es_cluster
  tasks:
    - name: everything that comes before the part where you need to run in serial: 1
      etc: etc

- name: swap to serial 1 to avoid taking this cluster down during updates
  hosts: es_cluster
  serial: 1
  roles:
    - role: role that might restart the elasticsearch service

- name: back to normal for the rest of this playbook
  hosts: es_cluster
  etc: etc

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

LochNessMonster posted:

It’s fully on prem unfortunately and working with ansible is considered black magic around here.

Also one of the reasons why I didn’t renew my contract with this client. There is virtually automation, provisioning VMs can take months (I onlu wish I was joking) as there is no spare capacity.

It’s my last feature to deliver before focussing on some knowledge transfer and I have a strong urge to do it properly. But at this point I’m considering to just run the whole thing in serial and advise them to change it to percentile in the future so I can be done with it.

Honestly it might make sense to just write something quick and dirty that spins until the cluster is healthy, acquires a lock somewhere (DB, consul, etcd, redis, whatever), restarts the service, then releases the lock.

Run that on every node at the same time. It'll take a while but at least you won't have to coordinate everything manually.

LochNessMonster
Feb 3, 2005

I need about three fitty


12 rats tied together posted:

Looping on two tasks together breaks ansible's execution model and the guarantees it tries to give you about when it will do things, you can root around in the github issues for some further discussion from people with more knowledge about the task scheduler internals if you're interested, but that's my takeaway here after looking into this at $lastjob for some people.

What you're describing though, as far as I can tell, is not a situation that would need "with_together" or similar -- you have a restart task and a health check task. If the health check fails you don't want to retry the restart task, you want to fail the play or trigger some handler. You can do this with normal handlers by just including both handlers in your notify block(s):

code:
- name: manage es config files
  template:
    whatever: whatever
  notify:
    - restart elasticsearch
    - wait for elasticserach
Keep in mind the usual warnings about handler execution order here -- that is, the handlers execute in the order that they appear in your handlers file, not the order that they are listed in the notify block.

This looks like exactly what I was looking for. For some reason it hasn’t occured to me to use the handlers as a list. I tried chaining them which obviously doesn’t work.

This will still run on all servers in parallel unless I let this role run with serial: 1, right?

quote:


You also have a separate issue: you want the handlers to run on a particular chunk of your play at once, the feature for this is serial: 1 (or whatever) as you've noted. I'm going to guess that by "run the whole thing in serial" that you're expecting a huge speed decrease from this? If that's true, please accept this quick reminder that playbooks can contain multiple plays, which can execute with different strategies, different configurations for the same strategy, different connection mechanisms, etc.

code:
- name: do a bunch of stuff to this elasticsearch cluster
  hosts: es_cluster
  tasks:
    - name: everything that comes before the part where you need to run in serial: 1
      etc: etc

- name: swap to serial 1 to avoid taking this cluster down during updates
  hosts: es_cluster
  serial: 1
  roles:
    - role: role that might restart the elasticsearch service

- name: back to normal for the rest of this playbook
  hosts: es_cluster
  etc: etc

I’m indeed expecting a rather large time increase when running the complete playbook in serial.

Splitting up the playbook will need some rewriting I guess, but this should be able to get the job done.

Thank you for pointing all these things out, I’ve been struggling this for the better past of last week and just couldn’t let it go.

12 rats tied together
Sep 7, 2006

Serial: 1 should behave as you describe but it's always worth a check_mode run to confirm :)

You may want to consider setting changed_when on your tasks that fire the handlers to make sure that they always fire them, while you're testing only. It might also be the case that running ansible-playbook with the --list-tasks switch will give you a nice confirmation of what will happen re: your serial settings, without actually having to run in check_mode, but it's been a while since I've used it.

12 rats tied together
Sep 7, 2006

Double post for a sidenote:

There are situations where with_together or some kind of dual looping would be nice/are actually required. Most commonly I've seen this when executing complex chains of actions that have dependencies but also arbitrary rollback points -- something like, you have task items 1 through 10. If you've done 1 through 3, you can safely restart from 3 from now on, otherwise you need to restart at 1 or perform some annoying cleanup.

Less abstract example: you might want to domain join a machine, and then bootstrap chef on it right after. If the chef bootstrap fails, you want to scratch the machine entirely and reimage, so we also need to remove it from AD.

The only really atomic thing in ansible is the task, so if you have something that you want to be operationally atomic that requires multiple tasks, you have to use a block. Blocks have some limitations (might be better in ansible 2.7+) on them that sometimes makes the thing you want to do impossible while still being a valid block. In this scenario IMO the best thing to do is to zoom in instead of zoom out: instead of trying to fight the playbook parser to arrange some complex behavior, write a custom module that manages whatever two (or more) things you need to do. Writing a module is really easy, you basically just:

1. Create a library/ folder in your ansible repo, alongside your playbooks (or see this config option)
2. Create a modulename.py file in this folder.
3. from ansible.module_utils.basic import AnsibleModule (also see: this docs link)
4. You can write full python here, which will be executed on the _remote_ node (if you want to write python that implicitly runs on the control host, you want an action plugin)
5. Your module is usable in ansible-playbook now so long as you invoke it by its filename.

In the situation where you do actually need to chain together these two operations, you could create a lochness_elasticsearch_restart.py file in this library folder, do like a subprocess.popen or something in python to run bash commands, and then be able to granularly orchestrate your changes. When you subclass AnsibleModule you get access to exit_json() and fail_json(), which you pass an arbitrary dictionary that describes what happened. You must include "changed": True/False but you can add any other information you want in here. If you exit_json(), Ansible thinks your task was successful and fail_json() does what it looks like it does.

The official developing modules docs are a good place to read but I think the best way to get started is to check out the file module and just kind of copy the concepts.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

LochNessMonster posted:

The problem is that task 1 restarts the service and task 2 is the health check. No matter which solution I’m trying, ansible keeps running task 1 on all nodes before doing the health check (which it should do after each node).

The restart service task is currently a handler and it didn’t work if I let handler 1 notify handler 2.
Another thing you could do would involve using the free strategy (which can be paired with serial to limit concurrent execution) instead of the default linear strategy.

Pile Of Garbage
May 28, 2007



If you're using Ansible with AWX or Tower you could use Workflows.

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb
I've been using Chef for managing my personal server (leased dedicated server) which I use for a few simple things:
* Awful Yearbook (PHP, MySQL)
* Minecraft
* Mumble
* Subsonic

With the acquisition of Chef by Progress, I am looking to jump ship and migrate away from Chef.

I'd like to containerize these things but wasn't sure what I should be using to run these on my dedicated server and what that deployment process would look like. I'd like to keep it simple, and driven by git / bitbucket pipelines.

edit: Nomad is looking pretty nice. Deployment from bitbucket via ssh to nomad server running on my bare metal machine seems straightforward. I'd be fine with having the initial bootstrap steps of the remote machine being manual as long as it's simple (configure SSH and install nomad server). Also seems easy to run locally on Windows to make developing and testing easier before pushing to production

fletcher fucked around with this message at 06:00 on Dec 14, 2020

Adbot
ADBOT LOVES YOU

12 rats tied together
Sep 7, 2006

deedee megadoodoo posted:

Using ansible to do anything beyond its stated purpose is hell. We currently use ansible to orchestrate cloudformation because someone before my time thought that was a good idea. They were wrong. It is a loving mess.

Granted, cloudformation is a pile of poo poo. And if you’re literally only using the jinja2 templates from ansible to do some very simple templating then I could see it working but even then there is probably a better way.

Revisiting this, I'm evaluating approaches to gcp management at the new job and some of the alternatives people have come up with are absolutely deranged by comparison: https://github.com/GoogleCloudPlatform/cloud-foundation-toolkit/tree/master/dm/templates/network

necrobobsledder posted:

The whole orchestration / configuration management space is a clusterfuck [...]

Salt's language idioms for orchestration and configuration are pretty similar to Ansible.
IMHO Salt is the existing optimum, a synthesized version of the good things from all relevant tools, plus some extra stuff they thought up that people aren't doing yet but should be. Even having an event bus is a good example of this, I was looking into trying to hook up some event sourced ops actions at the last job and the most workable thing I could come up with was some awful Flink CEP stuff that would have worked but there was 0 chance we were getting the rest of SRE on board with writing event processors in java.

I find myself in the unfortunate position where I have a lot of experience with Ansible, and a lot of people are using Ansible, so I usually end up touching more Ansible. I was briefly shopping around for Salt shops late last year and couldn't find anything interesting, so I'm back to Ansible Toucher at the new job.

I would agree 100% that Salt is the better option, though. It's possible that Ansible has some porcelain that is slightly nicer than salt orchestration just due to the many years of community development focused entirely on orchestration, but I don't see anything fundamental about Ansible that couldn't be re-implemented elsewhere.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply