Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Docjowles
Apr 9, 2009

FWIW Terraform is, loving finally, starting to introduce some more legitimate looping constructs instead of that "count" garbage.

https://www.hashicorp.com/blog/hashicorp-terraform-0-12-preview-for-and-for-each

Adbot
ADBOT LOVES YOU

12 rats tied together
Sep 7, 2006

Here's my take on the cloud-infracode problem that's been working great for me for the past ~3 years or so:

Do use ARM templates. Use ansible to manage your arm templates (https://docs.ansible.com/ansible/2.5/modules/azure_rm_deployment_module.html). Instead of specifying the template inline like in the examples, or in the template_link field, use an inline template lookup:

code:
[...]
    template: "{{ lookup('template', '/path/to/template.yaml.j2') }}"
[...]
Using an inline lookup here lets you take advantage of ansible's rich variable inheritance and templating mechanisms, so you get pretty much full jinja2 functionality, all of the extra ansible addons for jinja2, and you can write your own custom ones by putting python scripts in your ansible repo.

This lets you do things like loop through data structures of arbitrary complexity, resolve values from data structures from either local parsing or remote lookups (either from other machines or from a web service), it lets you define jinja macros, gives you fully functional if/else/etc flow logic, and fully supports reasonably complex inheritance including the ability to pull a default data structure and extend it or combine it in some way, null out various keys or values, etc.

And then, afterwards, you can use ansible to deploy to your windows servers too, and all of the data you used for your ARM templates is available for you in an extremely consistent manner inside your server deploy code, or the other way around, or both ways around if required.

I prefer this approach to something like Terraform for a lot of reasons, but the main one is that jinja2 has already existed for like 10 years so there's a wealth of documentation, support, and tons of edge cases and gotchas already figured out, compared to something like Terraform where mitchellh needed to be convinced that nullable keys/values were a genuine requirement for infracode tooling.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
An approach that I have mixed feelings about is using Jinja templates to generate Terraform code that is then executed by the Ansible playbook. Kind of scary but at the same time kind of makes sense.

I was kind of stupid in hindsight and wrote my own system to parse and merge variables from yaml files and pre-process them that took maybe a couple days of effort because I thought using Ansible was overkill, and at this point I should probably drop it all in favor of using Ansible anyway because at least it's documented in how awkward it is and I don't need to make a Python package for it to use it. Now that we're moving to the Salt ecosystem I'll still be getting some value from using Jinja templates for configuration management but seeing {{ pillar['foo'] }} vs {{foo.bar}} can get confusing. What makes things even more awkward is we're also using Sceptre, which also uses yaml based configuration, so basically we'll have three different Jinja template drivers generating completely different kinds of YAML. Once I get Jenkins builds converted from Jenkinsfiles to Jenkins Job Builder, we'll have yet even more Jinja templates generating yaml.

It's sounding like an Xzibit meme itching to happen when we start chasing K8s here and evaluate Helm.

Still no test coverage worth a drat :(

12 rats tied together
Sep 7, 2006

Yeah you can definitely replace just about all of that with just ansible, even helm (sort of), and even jenkins (sort of).

If you've read the "choose boring technology" blog article, embracing ansible gets you reaaaally close the 2nd version of the Problems -> Technical Solutions image just by itself. It's an extremely versatile tool and I consider it about as useful as, and indispensable as, a bash shell.

I considered doing the ansible -> j2ed terraform thing in my current role but decided everything I need to be doing in Terraform I would be much better served doing in CloudFormation anyway so there wasn't a lot of value involved. If you're interacting with a vmware thing though like the article you linked, it does seem like templating terraform is one of your only options at the moment.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
In my defense, half the features I needed would have been Ansible modules that didn't exist at the time (they pushed some stuff into 2.5 and 2.6 I wanted) and would have required me to write a fair bit of code, and on a lines-of-code basis I wrote far less lines of Python than if I had turned it into a playbook. Basically, from the very same article we went with "Just Ship" as the more important principle because we're collectively awful at Ansible where I am. Management and other engineers signed off on my approach and we agreed that trying to hack around Ansible was an unacceptable time risk when we're on a very hard deadline to deliver a project.

The issue with "Just Ship" is that it's resulted in a ton of difficult-to-refactor systems and things are getting harder to "Just Ship" now because of the mentality.

Gyshall
Feb 24, 2009

Had a couple of drinks.
Saw a couple of things.

Docjowles posted:

FWIW Terraform is, loving finally, starting to introduce some more legitimate looping constructs instead of that "count" garbage.

https://www.hashicorp.com/blog/hashicorp-terraform-0-12-preview-for-and-for-each

I look forward to this being fully documented in Terraform 0.76.

Docjowles
Apr 9, 2009

Gyshall posted:

I look forward to this being fully documented in Terraform 0.76.

I'm enjoying that it was announced in July and still has not been released :allears: Not to mention all the caveats in the post about how the initial feature will be very limited and will grow over time.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Docjowles posted:

I'm enjoying that it was announced in July and still has not been released :allears: Not to mention all the caveats in the post about how the initial feature will be very limited and will grow over time.
They just need to close out a few more issues fir

Bhodi
Dec 9, 2007

Oh, it's just a cat.
Pillbug
Vault is such a solid and well put together product, i didn't know teraform had so many issues.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.
Terraform's also a solid and well-put-together product—we manage something like 5,000 compute instances with it and it's mostly been painless since the 0.9 release—but it does so many things that everyone has thirty feature requests and a bug story to tell

Hadlock
Nov 9, 2004

When I came in to my latest job, a contractor had written all our aws stuff in terraform 0.70 out whatever, and absolutely none of it worked in 0.8x when I arrived and had to work with it. To be fair one should expect breaking changes in a sub-1.0 release, but... It's been out for a very long time at this point and was surprised that our pretty vanilla code had all broken so horribly. Will probably look at terraform again when they finally hit 1.0. Most of our critical aws infra is managed by kops anyways.

I prefer etcd to vault but everything vault is fully cryptographically secure by default so we are using vault. We did run in to a problem this summer where a runaway process consumed all the disk, and vault wrote an incomplete entry to consul which broke everything until I manually played Russian roulette deleting the suspect keys (it was an encrypted vault lease being written) and it was able to boot.

Now we just take snapshots of vault every ~8 hours and the plan is to roll back to the last known good snapshot and pray. Because it's vault, the snapshots are encrypted at rest out of the box.

Recently someone told me about vault agent which is some sort of templating feature built in, I'm interested in checking that out.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Hadlock posted:

When I came in to my latest job, a contractor had written all our aws stuff in terraform 0.70 out whatever, and absolutely none of it worked in 0.8x when I arrived and had to work with it. To be fair one should expect breaking changes in a sub-1.0 release, but... It's been out for a very long time at this point and was surprised that our pretty vanilla code had all broken so horribly. Will probably look at terraform again when they finally hit 1.0. Most of our critical aws infra is managed by kops anyways.

I prefer etcd to vault but everything vault is fully cryptographically secure by default so we are using vault. We did run in to a problem this summer where a runaway process consumed all the disk, and vault wrote an incomplete entry to consul which broke everything until I manually played Russian roulette deleting the suspect keys (it was an encrypted vault lease being written) and it was able to boot.

Now we just take snapshots of vault every ~8 hours and the plan is to roll back to the last known good snapshot and pray. Because it's vault, the snapshots are encrypted at rest out of the box.

Recently someone told me about vault agent which is some sort of templating feature built in, I'm interested in checking that out.
0.10 was a pretty important release for Terraform, because that's the one that broke all the providers out, versioned them independently of Terraform core, and allowed you to lock which provider versions are used for managing a project/workspace. Breakage has been much less common since, and you can go as far as upgrading one provider at a time.

Hadlock
Nov 9, 2004

Vulture Culture posted:

0.10 was a pretty important release for Terraform, because that's the one that broke all the providers out, versioned them independently of Terraform core, and allowed you to lock which provider versions are used for managing a project/workspace. Breakage has been much less common since, and you can go as far as upgrading one provider at a time.

Ah that's good to know. Sounds more principled for sure. Will have to take another look soon then.

Another comparison, we had been on Vault 0.7.0, ended up upgrading to vault 0.10.0, now we are on 0.11.0 with no breaking changes... Ah and looks like 1.0.0 beta has just been released. Pretty pleased with that progression, more what I was expecting out of terraform.

Docjowles
Apr 9, 2009

I wouldn't hold my breath for Terraform to hit 1.0. It's been developing at an absolutely glacial pace lately.

freeasinbeer
Mar 26, 2015

by Fluffdaddy
I’m honestly moving away from it and more into pure kubespecs. I don’t see the need for it as much.

12 rats tied together
Sep 7, 2006

Vulture Culture posted:

Terraform's also a solid and well-put-together product—we manage something like 5,000 compute instances with it and it's mostly been painless since the 0.9 release—but it does so many things that everyone has thirty feature requests and a bug story to tell

In my experience Terraform doesn't really fall apart until you have to start managing several disparate environments for it, and for that something like provider + provider region count is usually a more useful measure of "how much is Terraform ruining my day to day life right now".

It's an absolutely fantastic product if you just need to configure 5000 compute instances. var.count plus the splat operator is actually pretty clever and intuitive (generally). Once you start needing to configure vpcs, subnets, regions, multiple cloud accounts, anything with different authentication mechanisms, stuff that might exist sometimes and other times might not exist, it turns into a nightmare shitshow almost instantly.

The combination of "works great for things that are simple" plus "is probably worse than clicking buttons in the ui" once you want to build a single reusable abstraction makes it a particularly dangerous tool IMO. It's like they intentionally built a tool for executable blog articles.

e: I think if you're really excited about for loops in terraform (I am not), you should consider that basically what you want to do is document templating, and that document templating languages have existed for a long time and you definitely do not need to wait for terraform x.y.z to solve this problem for yourself.

12 rats tied together fucked around with this message at 20:21 on Oct 26, 2018

12 rats tied together
Sep 7, 2006

necrobobsledder posted:

In my defense,

I'm sorry, I didn't mean to imply that you made a poor choice or anything. There are tons of great reasons to not use ansible, "nobody here knows ansible" is definitely one of them. Because the tool is so easy to use, I often find myself getting into tit-for-tat slack threads where basically all I do is say "you can actually do that in ansible already: <docs link>".

I'll have someone link me a blog article where the author just did not notice that the limitation they are complaining about is directly addressed by a core feature, or I'll have someone chime in and be like "well at my last job we used ansible from an employee laptop to copy a 1 GB file across the internet 3000 times to 3000 servers and it was slow, therefore, ansible is a bad tool for deploying software".

Since I've been spending a lot of time doing that lately, it is kind of my default reaction to a technical problem description even though it's usually not appropriate.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

12 rats tied together posted:

In my experience Terraform doesn't really fall apart until you have to start managing several disparate environments for it, and for that something like provider + provider region count is usually a more useful measure of "how much is Terraform ruining my day to day life right now".

It's an absolutely fantastic product if you just need to configure 5000 compute instances. var.count plus the splat operator is actually pretty clever and intuitive (generally). Once you start needing to configure vpcs, subnets, regions, multiple cloud accounts, anything with different authentication mechanisms, stuff that might exist sometimes and other times might not exist, it turns into a nightmare shitshow almost instantly.

The combination of "works great for things that are simple" plus "is probably worse than clicking buttons in the ui" once you want to build a single reusable abstraction makes it a particularly dangerous tool IMO. It's like they intentionally built a tool for executable blog articles.

e: I think if you're really excited about for loops in terraform (I am not), you should consider that basically what you want to do is document templating, and that document templating languages have existed for a long time and you definitely do not need to wait for terraform x.y.z to solve this problem for yourself.
I've found it to not actually be bad for any of these use cases, with judicious use of per-environment projects and clear separation of modules, but the documentation does make it incredibly obtuse how you should structure this and split code between projects so as to not blow your foot off with a sawed-off shotgun. terraform import has been great, but still isn't supported on enough resources and can be mind-bogglingly annoying to use on providers with (IMO) hostile APIs like AWS.

StabbinHobo
Oct 18, 2002

by Jeffrey of YOSPOS
no tool can can cleanly abstract the architectural landfill heap produced by the standard corporate model of "a gaggle of people changing their mind every two weeks" over a meaningful time horizon

Bhodi
Dec 9, 2007

Oh, it's just a cat.
Pillbug

Vulture Culture posted:

I've found it to not actually be bad for any of these use cases, with judicious use of per-environment projects and clear separation of modules, but the documentation does make it incredibly obtuse how you should structure this and split code between projects so as to not blow your foot off with a sawed-off shotgun. terraform import has been great, but still isn't supported on enough resources and can be mind-bogglingly annoying to use on providers with (IMO) hostile APIs like AWS.
out of curiosity, before I implement things the wrong way, any blogs or guides on doing this? I have to deal with multiple nearly airgapped environments (we can do syncs of git, but only as a release process with a zipfile, but no syncing of environmental state data in/out because it contains hostnames). Trying to come up with decent tools and processes to handle promotion to higher environments from dev/qa and the fact a real live person has to perform the sync really puts a damper on your traditional commit-build-test-tag-deploy framework.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Bhodi posted:

out of curiosity, before I implement things the wrong way, any blogs or guides on doing this? I have to deal with multiple nearly airgapped environments (we can do syncs of git, but only as a release process with a zipfile, but no syncing of environmental state data in/out because it contains hostnames). Trying to come up with decent tools and processes to handle promotion to higher environments from dev/qa and the fact a real live person has to perform the sync really puts a damper on your traditional commit-build-test-tag-deploy framework.
First, I'm going to challenge "no syncing of environmental state data in/out". You can use Consul, etcd, any S3-compatible datastore (e.g. Ceph or OpenStack Swift), local Artifactory, or a REST endpoint to handle first-party storage of the state information. Any of these options will work fine in an airgapped configuration. If you're doing cloud in an airgapped environment, I assume you're running in an OpenStack environment, so you should be able to just use whatever you're currently running for object storage.

(e: unless I misread you as "there's no conceivable way to share state data from one environment to another", in which case it sounds like you have very little managed infrastructure to share anyway.)

I have a few basic guiding principles for managing our Terraform configurations, which can be fairly complex at times:

  • Terraform remote state isn't just for synchronizing the state of a project between multiple computers or developers. You can also import that remote state with the terraform_remote_state data source, which allows you to abstract out even derived pieces of configuration data in a way where it can be shared between multiple projects. As an example: if you want multiple dev environments sharing common infrastructure in a common VPC, use a module to instantiate it, then export the IDs of important resources via remote state. You can import those IDs in your environment-specific projects.
  • As a corollary to the above, migrating resources between projects is still a work in progress [disaster] in Terraform, so it's better to add too many layers than too few.
  • Resist the temptation to build everything into Terraform. It shouldn't be a God Object for your infrastructure any more than Puppet or any other declarative configuration management tool. Tie into other tools everywhere they're appropriate.
  • Your projects may end up encapsulating thousands or tens of thousands of resources. The state of each resource needs to be queried every time you create a Terraform plan. Consider this when you figure out where the boundaries between projects should lie.
  • Modules are your friend. Use them freely, but don't try to stretch them too far or make their logic too complicated. If you need anything more complicated than count driving your conditional logic, make a different module. A little bit of copying and pasting is way better than a series of janky, completely broken abstractions.
  • Keep your module hierarchies flat. Include as many modules as you want from your project, but don't import modules from other modules. Pass data back down from your modules via exported variables, aggregate it together in your top-level project files, and pass that data to higher-level modules from there. You'll be much happier. Your code will also be much more composable.
  • Terraform has a Workspaces configuration, which used to be Environments. I've never used it. I use a separate project per environment to keep things completely, totally unambiguous and limit opportunities for pilot error.

StabbinHobo posted:

no tool can can cleanly abstract the architectural landfill heap produced by the standard corporate model of "a gaggle of people changing their mind every two weeks" over a meaningful time horizon
For sure. In environments that aren't legally bound up by regulatory compliance requirements, it's sanest to push responsibility as close to the application owners as possible. You own it, you run it. It's your department's AWS account, do whatever you want. Use whatever tools, or no tools.

Vulture Culture fucked around with this message at 20:59 on Oct 27, 2018

Bhodi
Dec 9, 2007

Oh, it's just a cat.
Pillbug

Vulture Culture posted:

First, I'm going to challenge "no syncing of environmental state data in/out". You can use Consul, etcd, any S3-compatible datastore (e.g. Ceph or OpenStack Swift), local Artifactory, or a REST endpoint to handle first-party storage of the state information. Any of these options will work fine in an airgapped configuration. If you're doing cloud in an airgapped environment, I assume you're running in an OpenStack environment, so you should be able to just use whatever you're currently running for object storage.

(e: unless I misread you as "there's no conceivable way to share state data from one environment to another", in which case it sounds like you have very little managed infrastructure to share anyway.)

I have a few basic guiding principles for managing our Terraform configurations, which can be fairly complex at times:

  • Terraform remote state isn't just for synchronizing the state of a project between multiple computers or developers. You can also import that remote state with the terraform_remote_state data source, which allows you to abstract out even derived pieces of configuration data in a way where it can be shared between multiple projects. As an example: if you want multiple dev environments sharing common infrastructure in a common VPC, use a module to instantiate it, then export the IDs of important resources via remote state. You can import those IDs in your environment-specific projects.
  • As a corollary to the above, migrating resources between projects is still a work in progress [disaster] in Terraform, so it's better to add too many layers than too few.
  • Resist the temptation to build everything into Terraform. It shouldn't be a God Object for your infrastructure any more than Puppet or any other declarative configuration management tool. Tie into other tools everywhere they're appropriate.
  • Your projects may end up encapsulating thousands or tens of thousands of resources. The state of each resource needs to be queried every time you create a Terraform plan. Consider this when you figure out where the boundaries between projects should lie.
  • Modules are your friend. Use them freely, but don't try to stretch them too far or make their logic too complicated. If you need anything more complicated than count driving your conditional logic, make a different module. A little bit of copying and pasting is way better than a series of janky, completely broken abstractions.
  • Keep your module hierarchies flat. Include as many modules as you want from your project, but don't import modules from other modules. Pass data back down from your modules via exported variables, aggregate it together in your top-level project files, and pass that data to higher-level modules from there. You'll be much happier. Your code will also be much more composable.
  • Terraform has a Workspaces configuration, which used to be Environments. I've never used it. I use a separate project per environment to keep things completely, totally unambiguous and limit opportunities for pilot error.

For sure. In environments that aren't legally bound up by regulatory compliance requirements, it's sanest to push responsibility as close to the application owners as possible. You own it, you run it. It's your department's AWS account, do whatever you want. Use whatever tools, or no tools.
The airgap is a security control, not a technical one. The only way data crosses boundaries is through very specific channels, with nothing automated. It kinda sucks. We push a "release" bundle of git code and it goes through our security dept and they examine then they place it in the git repo of the requested environment for us. We're on various vpcs within aws govcloud, running our own kubernetes in ec2 (eks not offered yet), each site having separate accounts. From what I was reading, I think I can use the same terraform config in our sync'd git repo to manage multiple environments with remote state, but pointing to a different s3 bucket per site, and then workspaces to handle the prod/stage breakout within the site. Alternately, we could break out the different prod/stage environments within sites into their own modules. Then again, limiting opportunities for pilot error is fairly important, we were worried about having a separate project per application would clutter up our repo too much, but that may be a better way to go, except that then there's no easy way to track overall what terraform has deployed, and if someone wants to make a one-off box for testing, they need to fork the primary project repo and make their own just for one box - seems a little wasteful, and annoying. Because of the sync issue, in that we have to release and package each repo as a zip separately, we'd really prefer a single repo but with different terraform config files within it that you specify at runtime - workspaces would probably solve that, except that we can't have workspaces be both the prod/stage differentiation and also the app, unless we want to name the workspaces app-stage and app-prod, which is uh. no.

I also need to find a way of giving terraform derived variables for the vpcs and such to create hosts, and what makes sense in this context. I think modules are likely the answer here. Fortunately, we're really only looking at terraform to create random one-off machines for apps that don't belong in containers, and we're using chef to manage apps, so we're using terraform strictly as our instance deployment manager.

The hierarchy thing is a good thing to note, definitely. I can see how this can get complex, fast.

Bhodi fucked around with this message at 21:55 on Oct 27, 2018

JehovahsWetness
Dec 9, 2005

bang that shit retarded

Bhodi posted:

I also need to find a way of giving terraform derived variables for the vpcs and such to create hosts, and what makes sense in this context. I think modules are likely the answer here. Fortunately, we're really only looking at terraform to create random one-off machines for apps that don't belong in containers, and we're using chef to manage apps, so we're using terraform strictly as our instance deployment manager.

We kept a separate variable file per environment and just passed them in w/ the `-var-file` since we had environments in different VPCs, etc. It's a simple to create format so any script could spit it out?

I agree that 0.10 made a real big difference with stability and most breakage I saw in the aws provider was because of dumb poo poo where AWS wouldn't return in attributes the _exact_ thing that terraform had passed in (or in the same order) so TF would always think there was a change to apply. I've contributed a couple of data sources to the AWS provider, it's a well managed/maintained project w/ decent turnaround down at the provider level.

Bob Morales
Aug 18, 2006


Just wear the fucking mask, Bob

I don't care how many people I probably infected with COVID-19 while refusing to wear a mask, my comfort is far more important than the health and safety of everyone around me!

We ended up getting TeamCity from JetBrains going and setup a few projects in it. Seems perfect.

We also have GitLab running internally.

Now, if I could get the guys to just use the loving tools...quit working locally and deploying manually. Why do you think we set this poo poo all up?

Helianthus Annuus
Feb 21, 2006

can i touch your hand
Grimey Drawer
do you have gitlab set up to trigger a pipeline for each push? devs love that poo poo

ideally they should be able to fire and forget, and get an email back saying "your poo poo is in prod now" or "lol you broke units dumbass"

Scruff_McGee
Mar 11, 2007

Thats a purdy smile
Does anyone have opinions on the O'Reilly DevOps bundle? https://www.humblebundle.com/books/dev-ops-oreilly It is a lot of content:

    Effective DevOps
    Moving Hadoop to the Cloud
    Cloud Foundry: The Definitive Guide
    Kubernetes: Up and Running
    Linux Pocket Guide
    Cloud Native Infrastructure
    Jenkins 2: Up and Running
    Deploying to OpenShift
    Database Reliability Engineering
    Practical Monitoring
    The Site Reliability Workbook
    Seeking SRE
    AWS System Administration
    Prometheus: Up and Running
    Designing Distributed Systems

Even if only one of these is good its probably worth the price of entry.

Docjowles
Apr 9, 2009

I bought it last night, so maybe I can post some initial thoughts before the deal expires :v:

The only one I already owned was the followup to the Google SRE book, which is good. I highly respect the authors of several others (Kelsey Hightower on the Kubernetes book, and Charity Majors on the database one, for example) so I expect they will be good, too. Seems like a great value, especially if you can get work to expense it!

jaegerx
Sep 10, 2012

Maybe this post will get me on your ignore list!


The k8s book is worth it alone. Cloud native from kris nova is great. Charity majors knows her poo poo so I assume it’s good.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.
Charity Majors is an abrasive poo poo-stirrer which means I like her opinions a lot

StabbinHobo
Oct 18, 2002

by Jeffrey of YOSPOS
<3 charity

on another note... i'm stumped on docker registry image names and tags and how they relate to k8s imagepullpolicy's and how I can tie this all together.

say you have an upstream docker hub image named "foo" and a tagged version of it we want named ":bar"

then you have a git repo with a dockerfile that starts with "FROM foo:bar" and a half dozen tweaks.

then you have a cloudbuild.yaml in that repo that looks roughly like

quote:

steps:
- name: 'gcr.io/cloud-builders/docker'
args: ['build', '-t', 'gcr.io/$PROJECT_ID/foo:bar', '.']
images: ['gcr.io/$PROJECT_ID/foo:bar']

then a build trigger setup so that any commits to that dockerfile push the new image into your GCR

great so we add a line to our docker file, the build gets triggered, and the new image gets pushed to the registry and it steals the name from or overwrites the previous one somehow (don't get this part).

that all "works" but then i have no idea how to get newly deployed instances of foo:bar to pickup the new image. they keep using the old one, even for new deployments.

googling around and reading it seems like there are three lovely options:
- change the ImagePullPolicy on all new deployments to be Always, live with whatever slowdown this causes (these things already take too long, ~30 seconds, I want that to go down not up)
- somehow involve the tag ":latest" which seems to have some hacked in corner case support, but also all the docs warn you against doing it, also I have no idea how I would make a "foo:bar:latest"
- stuff the short-sha in as a tag somehow, say foo_bar:ABCD1234, and only ever deploy the specific image (tag? wtf is the diff) for each new deployment.

I instinctively like the third option because its more explicit, however that leaves me with a random metadata dependency now. my utility that fires off new deployments (a webapp) now has to somehow what? scan the registry for new images? accept a hook at the end of the cloudbuild to know the tag and store that somewhere? That would create a weird lovely circular dependency where my webapp has to be up and working for my builds to succeed. And I have to somehow provide or pass it said tag at startup. Seems like a mess.

so, to recap:
- how do i make it so that my push-butan webapp can always deploy the latest version of our tweaked version of an upstream docker hub image, without making GBS threads where i sleep

edit: also i have no idea how its working now that i can just refer to the upstream "foo:bar" in my dockerfile, and that works, but then in my deployments say "foo:bar" and it somehow magically knows to use my "foo:bar" from GCR not the "foo:bar" from docker hub.

StabbinHobo fucked around with this message at 19:15 on Nov 7, 2018

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

StabbinHobo posted:

- stuff the short-sha in as a tag somehow, say foo_bar:ABCD1234, and only ever deploy the specific image (tag? wtf is the diff) for each new deployment.
Yes, unless you're going to cut immutable release numbers through some other process. This is important because you can then use kubectl describe deployment or docker ps to show exactly what you're running instead of having to introspect it through some other process that you maintain yourself. It also keeps you from needing to screw around with ImagePullPolicy directives, which defeats a lot of the value of Docker images for fast deployments if you aren't using micro-optimized Alpine images.

StabbinHobo posted:

my utility that fires off new deployments (a webapp) now has to somehow what? scan the registry for new images? accept a hook at the end of the cloudbuild to know the tag and store that somewhere?
I'd just tell it explicitly what version needs to be deployed, personally, then you can wire that into any other process you want. I personally push to keep the deployment mechanics separate from the policy ("only the latest version should ever be deployed"), because keeping the two tightly coupled complicates the process of changing either one.

StabbinHobo posted:

That would create a weird lovely circular dependency where my webapp has to be up and working for my builds to succeed. And I have to somehow provide or pass it said tag at startup. Seems like a mess.
TBH it's pretty weird that your tool both doesn't know what version it's deploying and can't be told what version to deploy. How do you track what versions are being deployed where at what times? This design seems like a liability at multiple levels that's worth spending time to fix.

StabbinHobo posted:

edit: also i have no idea how its working now that i can just refer to the upstream "foo:bar" in my dockerfile, and that works, but then in my deployments say "foo:bar" and it somehow magically knows to use my "foo:bar" from GCR not the "foo:bar" from docker hub.
The Docker image doesn't contain named references to other images, since all tags in Docker are mutable. Your image is composed of layers, and each layer has a SHA. When you push to a repository, it checks whether the layers you're trying to push already exist. If it has them already, it can skip them. If it doesn't have them, you will push them up along with your changes. At the end, whatever registry endpoint you're using will have a download for each constituent layer, whether the source image maintainer put it there or you did.

Vulture Culture fucked around with this message at 20:01 on Nov 7, 2018

Hadlock
Nov 9, 2004

We use core roller as our source of truth for container image versions. The web app, broker and database each have a application, and each app has three channels, dev/nightly, release (candidate)/qa and stable/prod. When the ci does a new build, it updates the dev/nightly channel, qa will cherry pick a release for testing, and then when they, or their tests bless a release, stable gets updated.

Coreroller has a simple restful api and all our scripts/deploy apps reference it as the up to date source of truth. Works great. We used the official CoreOS version, coreupdate at another company to do the same thing. Also helps keep your release manager sane as it generates an audit log in the db.

You could arguably handle this by managing it in a csv file on an http endpoint somewhere using basic auth, but restful json stuff is very portable and pretty much universally compatible, and they already did the work for you

Hadlock fucked around with this message at 21:48 on Nov 7, 2018

StabbinHobo
Oct 18, 2002

by Jeffrey of YOSPOS
thank you mr. vulture

Vulture Culture posted:

I'd just tell it explicitly what version needs to be deployed, personally, then you can wire that into any other process you want. I personally push to keep the deployment mechanics separate from the policy ("only the latest version should ever be deployed"), because keeping the two tightly coupled complicates the process of changing either one.

TBH it's pretty weird that your tool both doesn't know what version it's deploying and can't be told what version to deploy. How do you track what versions are being deployed where at what times? This design seems like a liability at multiple levels that's worth spending time to fix.

ok sorry but I don't understand what you're saying. this is all new and not meaningfully live yet. when I started I just had my tool/webapp hardcoded to use the "foo:bar" image for for each new deployment. for a good while we didn't have to change the image so it didn't matter. then a few times we did have to change it, but we were tearing down and rebuilding the cluster so frequently anyway for other stuff that it was easier to just do that and not think about it further. now we're getting to the point where we actually need to update it without a teardown so i'm here futzing. lol at even calling it a "design" I'm just playing k8s-docs+stack-overflow whack-a-mole.

so now if I go change the code in my webapp to deploy foo_bar:12345 then i've *one time* solved the problem but just moved the problem down the road a little farther. or i guess i've created a "every time you update the image you also have to redeploy the tool" workflow. not the end of the world, just more dependencies and schlock work. I could automate it by having the cloudbuild job post it to an endpoint on my webapp and then have my webapp store it in a datastore or something, but that seems wrong in a way I can't quite articulate. assume for the moment my webapp is ephemeral and doesn't even have a persistent datastore, when it gets redeployed it doesn't know what image to use until the next cloudbuild job is run to tell it.

conceptually "latest" is exactly what I want, its just this [1] [2] [3] [4] make that seem like a bad road.

[1] https://github.com/kubernetes/kubernetes/issues/33664
[2] https://kubernetes.io/docs/concepts/containers/images/#updating-images
[3] https://discuss.kubernetes.io/t/use-latest-image-tag-to-update-a-deployment/2929
[4] https://github.com/kubernetes/kubernetes/issues/13488

much like an "apt install foo" will just always give you the latest foo, I just want my deployment of an image to always use the latest image.

StabbinHobo fucked around with this message at 00:22 on Nov 8, 2018

Methanar
Sep 26, 2013

by the sex ghost
Can you make your build system output the sha of the branch you built somewhere and inject it into a kubectl/helm/whatever command to upgrade as part of your CD, if you're doing CD

For me that looks like this where image.tag is a variable in my deployment.yaml

code:
 
 17     spec:
 18       containers:
 19       - name: {{ .Values.name }}
 20         image: "gcr.io/thing/thing/thing:{{.Values.image.tag}}"

code:
helm upgrade thing  thing.tgz --set image.tag=f6e5810
If you really just want :latest to always be the latest and expect it to work, I think you will need to just set your imagepullpolicy to always. Which maybe isn't the end of the world if your stuff does deploy quick.

Methanar fucked around with this message at 00:28 on Nov 8, 2018

StabbinHobo
Oct 18, 2002

by Jeffrey of YOSPOS

Hadlock posted:

We use core roller as our source of truth for container image versions. The web app, broker and database each have a application, and each app has three channels, dev/nightly, release (candidate)/qa and stable/prod. When the ci does a new build, it updates the dev/nightly channel, qa will cherry pick a release for testing, and then when they, or their tests bless a release, stable gets updated.

Coreroller has a simple restful api and all our scripts/deploy apps reference it as the up to date source of truth. Works great. We used the official CoreOS version, coreupdate at another company to do the same thing. Also helps keep your release manager sane as it generates an audit log in the db.

You could arguably handle this by managing it in a csv file on an http endpoint somewhere using basic auth, but restful json stuff is very portable and pretty much universally compatible, and they already did the work for you

I googled core roller

and I'm pretty sure you just told me i'm fat and need to work out more. Sure, true, but I don't see how that helps :)

Nah, I think I get what you mean, but pretend this isn't some pre-existing professional IT environment with lots of tools and people to do work and poo poo. I'm trying to configure the basic bare minimum components for a PoC, not hook into an existing enterprise. Just looking at coreroller's screenshots screams overkill.

StabbinHobo fucked around with this message at 00:45 on Nov 8, 2018

StabbinHobo
Oct 18, 2002

by Jeffrey of YOSPOS

Methanar posted:

Can you make your build system output the sha of the branch you built somewhere and inject it into a kubectl/helm/whatever command to upgrade as part of your CD, if you're doing CD

For me that looks like this where image.tag is a variable in my deployment.yaml

code:
 
 17     spec:
 18       containers:
 19       - name: {{ .Values.name }}
 20         image: "gcr.io/thing/thing/thing:{{.Values.image.tag}}"

code:
helm upgrade thing  thing.tgz --set image.tag=f6e5810
If you really just want :latest to always be the latest and expect it to work, I think you will need to just set your imagepullpolicy to always. Which maybe isn't the end of the world if your stuff does deploy quick.

sure but what is that "somewhere" and then who's running that helm command? that just dragged another dependency *and* human into the flow.

Vanadium
Jan 8, 2005

There shouldn't be a human. My understanding based on doing something analogous with terraform and ECS and reading through a coworker's k8s setup is: Your automation/build script should not only generate a docker image and tag it with something like the git commit ID, it should at the same time also generate the updated yaml files (maybe from a template in the repo) referencing the new image tagged with a unique ID, and then shove that file into k8s or whatever.

Methanar
Sep 26, 2013

by the sex ghost

StabbinHobo posted:

sure but what is that "somewhere" and then who's running that helm command? that just dragged another dependency *and* human into the flow.

What I mean more is when your build job completes, your code has had its dependencies pulled down, modules compiled, docker images built and pushed to a registry, tests ran. Your task runner completes the build step and then executes another job for the deploy step. That next step might just be a helm upgrade.


Okay actually for a more complete example of how I've been doing it. We have an extra step where we upload tarballs to s3 for things that are not yet containerized in prod, this isn't necessary and you could omit it, but including it in my example.

Build your code, note the sha of the branch you built
code:
 BUILD_ID_SHORT="$(git rev-parse HEAD | cut -c1-7 - )" 
Build your docker images where you tag it with the sha of your branch, we also pass in the build ID as an arg for the internal shell script to know which build is to be embedded into the image.

code:
 docker build -t gcr.io/thing/thing/thing:${BUILD_ID_SHORT} --build-arg BUILD_ID=${BUILD_ID_SHORT} 
Where the makefile is something like this containing a shell script with the actual logic.

code:
  FROM ubuntu:16.04
  ARG BUILD_ID=1234567-prod
  ENV BUILD_ID=$BUILD_ID
 RUN /bin/bash -x /deploy.sh $SERVICE $BUILD_ID
Where the shell script is something like this to pull the assets out of s3
code:
x=1
while true; do
  ((x++))
  if aws --profile "$AWS_PROFILE" s3 cp "${S3_PREFIX}/thing/thing-${BUILD_ID}.tar.gz" - | bsdtar -xzf - -C /apps/; then
    break
  fi
  if  [ $x == 6 ]; then
    echo "we have tried to download ${BUILD_ID} 5 times and failed. Aborting."
    break
  fi
  sleep 1
done
Finally you can run a deployment command, helm or otherwise, also within your task runner

code:
 helm upgrade thing --set image.tag=$BUILD_ID_SHORT
code:
 kubectl set image deployment.v1.apps/deployment app=app:$BUILD_ID_SHORT

Methanar fucked around with this message at 01:24 on Nov 8, 2018

Scikar
Nov 20, 2005

5? Seriously?

StabbinHobo posted:

<3 charity

on another note... i'm stumped on docker registry image names and tags and how they relate to k8s imagepullpolicy's and how I can tie this all together.

say you have an upstream docker hub image named "foo" and a tagged version of it we want named ":bar"

then you have a git repo with a dockerfile that starts with "FROM foo:bar" and a half dozen tweaks.

then you have a cloudbuild.yaml in that repo that looks roughly like


then a build trigger setup so that any commits to that dockerfile push the new image into your GCR

great so we add a line to our docker file, the build gets triggered, and the new image gets pushed to the registry and it steals the name from or overwrites the previous one somehow (don't get this part).

that all "works" but then i have no idea how to get newly deployed instances of foo:bar to pickup the new image. they keep using the old one, even for new deployments.

googling around and reading it seems like there are three lovely options:
- change the ImagePullPolicy on all new deployments to be Always, live with whatever slowdown this causes (these things already take too long, ~30 seconds, I want that to go down not up)
- somehow involve the tag ":latest" which seems to have some hacked in corner case support, but also all the docs warn you against doing it, also I have no idea how I would make a "foo:bar:latest"
- stuff the short-sha in as a tag somehow, say foo_bar:ABCD1234, and only ever deploy the specific image (tag? wtf is the diff) for each new deployment.

I instinctively like the third option because its more explicit, however that leaves me with a random metadata dependency now. my utility that fires off new deployments (a webapp) now has to somehow what? scan the registry for new images? accept a hook at the end of the cloudbuild to know the tag and store that somewhere? That would create a weird lovely circular dependency where my webapp has to be up and working for my builds to succeed. And I have to somehow provide or pass it said tag at startup. Seems like a mess.

so, to recap:
- how do i make it so that my push-butan webapp can always deploy the latest version of our tweaked version of an upstream docker hub image, without making GBS threads where i sleep

edit: also i have no idea how its working now that i can just refer to the upstream "foo:bar" in my dockerfile, and that works, but then in my deployments say "foo:bar" and it somehow magically knows to use my "foo:bar" from GCR not the "foo:bar" from docker hub.

I think there's a question mark over what your webapp does with deployments exactly, that might be helpful to explain, but I think I can clear up a few things anyway if I'm understanding you correctly. Firstly, a tag is really just an alias for an image digest, and you can have multiple tags per image, updated on different cycles. So when you push foo_bar:ABCD1234 it can also update foo_bar:latest in the registry to point to the same image digest. That would give you a reference point for your deployments, so when you press deploy in your webapp it looks up foo_bar:latest, reads the digest, and creates a new deployment with that digest directly, kubernetes doesn't have to know or care what the tag is. Or if you need to upgrade existing deployments, get and patch them with the updated digest so kubernetes can do a rolling upgrade (I think, I'm not quite at that point yet myself).

Lastly, if you do have a hook I would put it on your registry rather than the CI build. The build process just has to get the new image up to the registry and it's done. The registry itself (I assume, we use the Azure one which can at least) can then notify your webapp that a new version was pushed and trigger that workflow, but if your webapp isn't running your build still succeeds (and presumably your webapp can just run the workflow when it does get started again).

Adbot
ADBOT LOVES YOU

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

StabbinHobo posted:

ok sorry but I don't understand what you're saying. this is all new and not meaningfully live yet. when I started I just had my tool/webapp hardcoded to use the "foo:bar" image for for each new deployment. for a good while we didn't have to change the image so it didn't matter. then a few times we did have to change it, but we were tearing down and rebuilding the cluster so frequently anyway for other stuff that it was easier to just do that and not think about it further. now we're getting to the point where we actually need to update it without a teardown so i'm here futzing. lol at even calling it a "design" I'm just playing k8s-docs+stack-overflow whack-a-mole.
I'm having a little bit of trouble understanding what "my tool/webapp" actually does, or why it's important to your workflow that this application does it. Could you elaborate? I want to make sure I'm providing good recommendations and not inundating you with general K8s guidance that for some reason doesn't fit your specific business requirements at all.

StabbinHobo posted:

so now if I go change the code in my webapp to deploy foo_bar:12345 then i've *one time* solved the problem but just moved the problem down the road a little farther. or i guess i've created a "every time you update the image you also have to redeploy the tool" workflow. not the end of the world, just more dependencies and schlock work. I could automate it by having the cloudbuild job post it to an endpoint on my webapp and then have my webapp store it in a datastore or something, but that seems wrong in a way I can't quite articulate. assume for the moment my webapp is ephemeral and doesn't even have a persistent datastore, when it gets redeployed it doesn't know what image to use until the next cloudbuild job is run to tell it.
An image isn't a deployment construct, it's an image, in the same way that a VMware template isn't a cluster of VMs, it's a template. If your deployment approach conflates the two, you're going to have a bad time. You don't need to do something crazy like put a Spinnaker CD system into production, or even a "package manager" like Helm, but you should leverage things like Kubernetes deployments where they make sense. The problem you're trying to deal with—upgrading a deployed K8s application to a new image tag—is handled by the kubectl set image construct. If you're doing bare-bones DIY continuous deployment, a simple way to drive this would be to have your build system update that image reference in Kubernetes once the images are pushed up to the repository.

StabbinHobo posted:

much like an "apt install foo" will just always give you the latest foo, I just want my deployment of an image to always use the latest image.
One thing you might not have considered is that if you use tags for this approach, when an application is automatically respawned—say, because you have a host fall over and your pod gets automatically rescheduled onto another host—that host will start the deployment process by pulling down the latest version of the image, giving you an upgrade you weren't prepared for and might not want right now. In the case of something like Jenkins (an intentionally convoluted, triggering example of an app), this can be painful because it will grab an image that might be completely incompatible with the configuration or the on-disk volume data that you've configured your app to use. In almost all cases, it's better to be explicit about the version you want within Kubernetes, and use something else to drive the policy of ensuring your applications are at some specific version or other.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply