Continuous Integration/build engineering/devops thread

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread

FamDav: Mar 29, 2008

with some combination of heroku, circleci, and github you could get your setup running a poor version of continuous deployment in a few hours.

# ¿ Jan 20, 2016 03:00

Adbot: ADBOT LOVES YOU

# ¿ May 1, 2024 00:22

FamDav: Mar 29, 2008

revmoo posted:

It seems simple to just wave a hand and say it's easy, but I can see a ton of pitfalls; session management being a huge one. Right now our deploys are less than 1ms between old/new code and we're able to carry on existing sessions without dumping users. I don't see how Docker could do that without adding a bunch of extra layers of stuff.

Are you using persistent connections? Storing session data local to each host? How many instances of your application are you running?

I'd have more specific answers depending on the particulars of your software, but my general answer is that it sounds like you're deploying for the success case rather than the failure case. What if your deployment takes too long to start up, or never comes back online? And what about things that happen otherwise, like hardware failures?

Deployment is a great way to test that your service is resilient to some bog standard kind of outages. Depending on host fidelity and near instantaneous switch overs is dangerous because it will break. It's also orthogonal to docker itself.

# ¿ Sep 22, 2016 17:58

FamDav: Mar 29, 2008

revmoo posted:

I'm quite happy with my deployment methodology and I'm not interested in changing it. I would definitely like to explore Docker for infrastructure management but I couldn't imagine using it to deploy code.

So docker (disregarding swarm) is not the right tool for managing deployments of software; its more akin to xen or virtualbox than any orchestration tool. for using containers to deploy code, there's

* kubernetes
* ec2 container service
* azure container service
* hashicorp nomad
* docker swarm
* probably other things!

which use docker to deploy containers. they all have a variety of tools and controls that let you manage containers across a fleet of machines, whereas docker itself is the daemon that manages a single host's worth of containers.

quote:

I'm really just curious why anyone would do it that way. I guess it would make sense if you built your application to scale massively from the very beginning rather than an organic evolution, but that's incredibly rare in this industry. Like, my understanding is FB deploys a monolithic binary (using torrents IIRC), but I don't think even they are pushing whole machine images at this point.

I'm not up to date on how facebook deploys its frontend (which is only a part of its infrastructure), but there are companies like netflix who deploy entire machine images to cut down on mutation in production. moving the amount of stateful modifications you make to production from the instance level -- like placing new code in the same VM as your currently running software, or updating installed packages -- to the control plane level -- replace every instance of v0 with v1 -- means less chances for unintended side effects while also making it much simpler to audit what you are actually running.

quote:

I'm not really understanding your statements about deploying for success or failure. My deployments are resilient, and have a number of abort steps. We saw the occasional failed deploy bring down prod once or twice a long time ago but we re-engineered the process to work around those issues and haven't had issues in a loooooong time. Nowadays if a deploy fails, that's it, it just failed and we address the issue. Prod remains unaffected. I haven't seen prod brought down from a deployment in ages. I can't imagine why using Docker would make these sorts of issues better or worse, if anything you're just facing another spectrum of possible issues.

I have no idea what this means.

deploying for failure is treating failed deployments as the norm and making sure your service is resilient to that.

I agree that docker itself does not directly help you w/ deployment safety because thats really a function of how you deploy software (not docker's problem) and how you wrote your software to handle deployments/partial outages (also not docker's problem). docker does help however you deploy software by

* providing an alternative to VMs as immutable infrastructure
* reducing the size/amount of software that has to be deployed with each application
* removing most of the startup time involved with spinning up a new VM

the reason why i brought up success vs failure is because it sounded like the expectation during a deployment for you is for cutover between old and new code to be

* on the same host
* near instantaneous

because you are depending on that for a good customer experience (maintaining sessions). this sounds bad to me, as there are various failure scenarios during and outside deployments where these things will not be true. This is an issue whether you use docker or not.

# ¿ Sep 22, 2016 19:14

FamDav: Mar 29, 2008

jaegerx posted:

Docker uses layers when updating so you're not pushing a whole new container. Just the changes you made. Think of it like a patch. Just diff old container vs new container

I think it's better to characterize a docker image as a lineage, where each layer represents an atomic progression. If you're rebuilding from a Dockerfile, for instance, you will end up replacing the topmost layers rather than extending from the most recently generated image.

It's also important to realize that a dockerfile generates a new layer for every docker command it executes. So if you download an entire compiler toolchain into your image just to discard it after you perform compilation, you are still downloading that toolchain on every docker pull. They still(!) haven't even given users an option to auto squash dockerfiles.

# ¿ Sep 24, 2016 05:11

FamDav: Mar 29, 2008

necrobobsledder posted:

Containers do not replace configuration management. However, it lets you get to a place where you can at least separate application configuration and artifacts from operations concerns such as logging, monitoring, kernel tuning, etc. The model also makes it easier to enforce 12-F applications instead of allowing lazy developers to dump files all over a filesystem.

One new problem that arises is that now containers can become quickly outdated if they're created in a typical lazy manner that includes way more dependencies than necessary (all those lazy people using Ubuntu base images are scary, man). However, you can very quickly update your container hosts (may God have mercy on your soul if you're not using container orchestration) and many patching tasks becomes specific to application containers. This greatly reduces the burden of change management off of operations teams and helps achieve higher density and separation of containers. For example, you can reschedule containers to a set of nodes that are isolated from known updated containers.

I still use Puppet and Chef to provision and maintain my worker nodes that host my containers.

One of the nicest benefits of distinguishing infrastructure from applications is that you can set up infra to rolling replace every say 48 hours and you can put tight bounds on your fleet heterogeneity and overall age.

# ¿ Jun 13, 2017 16:12

FamDav: Mar 29, 2008

Thermopyle posted:

I mostly use lambda nowadays, but I'd be interested in hearing what else you're talking about here as I don't really follow this space much.

lambda at in some sense is one of those things, but

* l4/l7 load balancers
* intensely scalable key-value stores
* exabyte-scale object storage
* managed mysql/postres/oracle + custom variants like aurora
* distributed tracing
* metrics, logs, alarms
* peta/exabyte-scale managed data warehouses
* several managed ML and "big" data systems
* build, deploy, and pipelines
* managed git
* CDN
* declarative infrastructure
* cloud windows desktops
* managed exchange, documents
* managed pub/sub
* managed queues, both in the form of SQS (lower overall throughput but with a very simple API for synchronizing consumers) and kinesis (effectively infinite throughput but requires more individual coordination)
* managed state machines
* managed work chat
* managed call center
* multi-account organizations
* a particularly thorough permissions system for all of this

and there's things ive missed, and things that have yet to be released

EDIT: to make this less of an appeal to list, i'll add that most of this stuff is free or effectively free on top of the cost of compute/storage/networking. and its also billed in fractional increments of fractional periods of time, so you really do pay for just what you use. the biggest downside of this (and a thing we need to focus on correcting) is that its so easy to forget about all the things you've started using and end up with a bill that doesnt accurately reflect what you actually used

FamDav fucked around with this message at 22:52 on Aug 6, 2017

# ¿ Aug 6, 2017 22:28

FamDav: Mar 29, 2008

Mr. Crow posted:

(see the several plain text DoD classified info leaks on AWS).

i don't think the lack of intelligence on behalf of the contractor willfully making classified data public in an s3 bucket is a good example of whether or not aws is a good choice for hosting.

# ¿ Dec 12, 2017 02:05

FamDav: Mar 29, 2008

Mr. Crow posted:

Ugh users being idiots has been the driving force behind restrictive IT policy since forever and is a definitive reason why companies wouldn't want to let their ip anywhere near public servers.

aws gives you practically all the tools necessary to have a restrictive IT policy if you want. for the above example, you can restrict the ability to create a publicly readable bucket organization-wide in about 10 lines of json.

there's plenty of valid reasons why you would prefer not to use public cloud, but the notion that its less secure or difficult/impossible to implement all kinds of IT policies is suspect.

FamDav fucked around with this message at 04:41 on Dec 12, 2017

# ¿ Dec 12, 2017 04:39

FamDav: Mar 29, 2008

Mr. Crow posted:

Ok so you are being argumentative over a one-off anecdote of apparently dubious quality and then arriving at the same conclusion.

I'm not arguing against you because its now p obvious that it isn't your opinion, I'm arguing against the idea that there's flexibility in one direction (ex. the ability make an s3 bucket world readable with the click of a button) and not the other (restrict almost any interesting option across any service, organization-wide).

# ¿ Dec 13, 2017 23:14

FamDav: Mar 29, 2008

Punkbob posted:

Yeah I�d do nginx ingress controller instead of treafik.

It�s a cool project but not super well integrated into kubernetes. To really run it you need to expose a key value store for it to hold state that nginx-ingress + kube LEGO does for you.

Edit also use kops, unless you really want to roll by hand.

And if you aren�t already in AWS I�d take a real hard look at GKE instead of managing it yourself.

EKS is right around the corner, and there is kops in the intervening time period

# ¿ Jan 7, 2018 17:14

FamDav: Mar 29, 2008

I don�t disagree with much of that but

Punkbob posted:

gcp is more secure then AWS by default

What examples led you to that belief?

# ¿ Jan 7, 2018 21:12

FamDav: Mar 29, 2008

Punkbob posted:

Basically GCP encrypts everything(like storage and network) by default you don�t have to add your own stuff on top. I used to have to do HIPAA compliance in the cloud and GCP basically starts as default secure while AWS requires a lot of work. If we didn�t have $100k in credit with them I would of pushed harder to dump them.

for storage on aws, you don't have to add any of your own stuff on top. you click the "encrypt this at rest" button and then either use aws encryption keys or your own, which you can do with just about any service that stores customer data.

as for networking, gcp doesn't encrypt all networking either (similiar to aws vpc, traffic is authenticated but not encrypted within a ~region), so if you're not supply some form of encryption between hosts in your kubernetes cluster that traffic is flowing unencrypted. i've seen a couple of people make this mistake because gcp does do encryption in transit for some aspects of their infrastructure.

# ¿ Jan 8, 2018 15:40

FamDav: Mar 29, 2008

Punkbob posted:

Mea culpa. I guess google just wasn�t worried about in transit and HIPAA peering. That�s a bummer that I misunderstood.

Still GKE is pretty dope, and you don�t have to wait for EKS. Also the rbac and google stuff is much easier then accomplishing the same with rear end/IAM.

But EKS hopefully should make kubernetes more turn key for more people.

yeah, sorry if i was rude. i get super concerned when anyone is making a compliance/security decision based on mistaken information, which there is a ton of around cloud services. combination of complexity of feature velocity :/.

# ¿ Jan 9, 2018 14:59

FamDav: Mar 29, 2008

Stringent posted:

And he works for Amazon.

I do, and sorry if i sounded like i was shilling. if they had made the same statement about aws networking (which is also authenticated but not encrypted within the region) i would've corrected them because making policy/security decisions on incorrect information is harmful to just about everybody, including other cloud providers.

# ¿ Jan 9, 2018 16:44

FamDav: Mar 29, 2008

Cancelbot posted:

We had one team use Spinnaker and its clunky and very slow, and we are a .NET/Windows company which doesn't fit as nicely into some of the products or practices available. Fortunately something magical happened last week: our infrastructure team obliterated the Spinnaker server because it had a "QA" tag and deleted all the backups as well. So right now that one team is being ported into ECS as the first goal in our "move all poo poo to containers" strategy.

Edit: We're probably going to restore Spinnaker but make it more ECS focused than huge Windows AMI focused.

So anecdotally there�s some things I worked on as part of the design for https://aws.amazon.com/blogs/devops/use-aws-codedeploy-to-implement-blue-green-deployments-for-aws-fargate-and-amazon-ecs/ that should eventually make integration with spinnaker a snap. Spoilers!

# ¿ Dec 16, 2018 00:52

FamDav: Mar 29, 2008

Docjowles posted:

You can definitely do this with a layer 7 load balancer. It was a configurable option on the piece of poo poo A10s we recently got rid of. The client makes its persistent TCP connection with the LB. The LB maintains its own set of TCP connections with each backend server. It inspects the client headers and each individual HTTP request in that session is round-robined across the servers. There isn�t really any fuckery involved at the TCP/IP layer.

This is how ALBs work as well. You can optionally choose sticky sessions to explicitly avoid redistribution.

For l4, there are plenty of ways to have the load balancer keep highly available flows by distributing flow decisions across hosts, but you�d be hard pressed to redistribute a tcp flow across back ends without it being shadow traffic.

# ¿ Dec 8, 2019 05:23

FamDav: Mar 29, 2008

necrobobsledder posted:

Last I could tell from the outside, half the CloudFormation blogposts and engineers appeared to be based out of India which makes me wonder if it's strategically important enough for AWS to put more higher profile engineers on them. Doubtful you'd see that happen to IAM, in contrast.

The CloudFormation team is based in Seattle and Vancouver. Are you basing this off names?

New Yorp New Yorp posted:

Containers / Docker Compose have nothing to do with Azure.

Don't prioritize integration tests, prioritize unit tests. Unit tests verify correct behavior of units of code (classes, methods, etc). Integration tests verify that the correctly-working units of code can communicate to other correctly-working units of code (service A can talk to service B). Both serve an important purpose, but the bulk of your test effort should go into unit tests.

Counterpoint: your customers don't use units, they use your application. Write unit tests so that you better define constraints between parts of your codebase, but unit testing actually shows how you what customers are experiencing. Make sure you make integration testing easy and consistent for your team so they write many of them.

# ¿ Apr 30, 2020 04:06

FamDav: Mar 29, 2008

freeasinbeer posted:

Also the dude behind envoy, who�s name escapes me.

matt klein

# ¿ Oct 30, 2020 04:43

FamDav: Mar 29, 2008

Zorak of Michigan posted:

Why take work home with you?

to be fair almost half those people are only good follows if you want hot takes on tech. very few people are out there giving good advice on twitter because its hard to build a following.

# ¿ Nov 3, 2020 02:15

FamDav: Mar 29, 2008

freeasinbeer posted:

So I�m fine with the design. But it�s seemingly at odds with the idea of microaccounts for IAM access. All of the pain of setting up IAM trusts and limited roles across mutiple accounts, to say yolo to the network tier.

there are alternatives like leveraging privatelink within an organization to connect services across many tiny VPCs

# ¿ Jan 18, 2021 19:41

FamDav: Mar 29, 2008

Methanar posted:

Where do the public IPs on an NLB actually go.

Like if my DNS record resolves to 1.1.1.1 and 2.2.2.2 and I have three availability zones as backends of my NLB. What availability zones does that ingress traffic to enter the NLB go to. I can't find how this entry-to-NLB part works in the aws docs.

Asking because us-west-2 has had a big fire today impacting an AZ's ability to communicate to the internet, and cross-AZ. We've been considering moving some workloads out of us-west-2, but for application reasons, its complicated. And its unclear how much it would benefit us given that there is the possibility that at least some traffic is being dropped before it enters the NLB in the first place, should it depend on the broken AZ.

Do those IPs have affinity to a particular AZ? Are they anycasted? Is there any failover mechanism where if one AZ is dead, it stops being advertised and traffic shifts over to a working AZ? Do I have any insight to this whatsoever as a user?

If the normal OS dns resolution scheme gives me an IP that is just broken because it goes to a bad AZ, I suppose it would be up to the application to have the correct logic to know to try all the IPs returned in the dns record response until it finds an IP that works.

each ip corresponds to a specific az-local loadbalancer, so if you have an nlb deployed into 3 azs you should have 3 ips in the dns response for your lb. by default nlb doesn't perform cross az routing, so the loadbalancer in zone 1 will only route to endpoints in zone 1, etc. you can optionally turn on cross-az routing which will distribute the entire set of endpoints to all zonal loadbalancers.

by default nlb will pull a zonal loadbalancer from dns if there are no healthy endpoints behind it, either because there are no endpoints in that zone or all of them are unhealthy. however, there is no explicit mechanism to control this via the nlb api.

however, there is a workaround! if you have your nlb dns record, say nlb-1234.us-west-2.amazonaws.com, then you can leverage the dns record us-west-2a.nlb-1234.us-west-2.amazonaws.com to get just the ip for that us-west-2a (and so on for all zones). you can use that along with weighted alias records in route 53 to implement your own weighting mechanism to weigh out an entire az from dns.

FamDav fucked around with this message at 22:11 on Aug 31, 2021

# ¿ Aug 31, 2021 22:05

FamDav: Mar 29, 2008

Blinkz0rz posted:

Any viable replacements for Docker Desktop on Mac yet? We're beginning to replace infra container management with containerd but the dev experience still sucks rear end and after Docker's announcement about the change to their licensing, I'd think other products would be ready to pounce.

i'm a fan of using lima and nerdctl or docker. once installed you can basically do

code:

alias docker="limactl shell <vm> docker"

and it all just works

https://github.com/lima-vm/lima

# ¿ Sep 11, 2021 02:40

FamDav: Mar 29, 2008

freeasinbeer posted:

If your management are unsure, then uh, this can be a bad time going alone or screaming into the void. It can then turn on a dime when something like an auditor gets wind, and starts threatening to put your risk for tampering on your 10k.

There has also been a lot of bs/tech talks out of AWS for example talking about micro accounts, and giving full access to them, they are pushing that plus CDK as the answer to delegating access and segregation.

There are ways to enforce sanity, we�re using OPA + Atlantis to do it for terraform, and are now introducing the same guardrails to our cloud formation ecosystem.

internal to aws, admin access to production is a carefully monitored if not completely verboten action. everything should be going through ci/cd pipelines leveraging cfn/cdk. you should be using many individual accounts to avoid the blast radius of even your cfn/cdk changes impacted multiple services, but that doesn't mean you should be giving anyone on the team full admin access save for "break glass/live preserver" scenarios.

# ¿ Sep 14, 2021 03:06

FamDav: Mar 29, 2008

Methanar posted:

I'm not a release expert, but I can't imagine a world where yolo deploying every commit, passing tests or not, is a good idea.

if your automation isn�t telling you what will be issues before they reach your customers. invest more in automation.

quote:

How do you handle any sort of schema update? Or dependency management at all. You probably don't want to have your app try to call APIs from some upstream that aren't implemented yet and have somebody typing git push be the gate keeper of that. And if you're constantly deploying everything across your engineering org how do you ever know what's really live anywhere.

first, you don�t let people push without a code review that includes release/revert instructions. api not in production? don�t ship it or put it behind a feature flag.

and you know what�s live everywhere by tracking state and querying it. a variety of deployment systems exist to manage this task, many open source.

quote:

Also how does multiple people working on a feature/branch work. How do you enforce that everybody involved is always properly rebased in exactly the right manner with no possibility of ambiguity or accidental regression breaking somebody else.

well, avoid feature branches and embrace feature flags. and if someone is dumb enough to do a long lived feature branch and screws things up on merge, the. that�s what the tests, staging environments, etc are for.

# ¿ Oct 9, 2021 17:47

FamDav: Mar 29, 2008

that on the extreme end has the problem of people cutting costs in ways that do not help the business, ex. via risky levels of efficiency or bureaucracy that reduces the amount of good work being done.

cost reduction is not something that operates on a single metric. even seemingly low hanging fruit is not a good priority for optimization because of the efficiency it allows elsewhere.

# ¿ Jan 9, 2022 20:40

FamDav: Mar 29, 2008

can i short your company

# ¿ Dec 13, 2022 06:52

FamDav: Mar 29, 2008

consul is an objectively bad design for all of the problems it purports to solve, and its failure modes are egregious. it makes docker swarm look well reasoned.

if you're rolling out a new deployment of consul in 2022 then you really need to step back and ask yourself if you should be in this business.

# ¿ Dec 15, 2022 09:22

FamDav: Mar 29, 2008

Methanar posted:

I was no joke half way through writing something positive when pagerduty paged me for the 4th time today.
(For the third time this month. Somebody on the security team pushed out broad changes and walked away without testing or validating poo poo and broke everything leaving me to get called in for it because Kubernetes is the most visible thing to fail. Wasn't even Kubernetes-specific this time - just broke everything that depends on running the base Chef role)

does this stuff not materially impact revenue, or does your leadership just do a poo poo job of incentivizing people to not cause significant outages all the time

FamDav fucked around with this message at 20:55 on Jan 30, 2023

# ¿ Jan 30, 2023 02:01

Adbot: ADBOT LOVES YOU

# ¿ May 1, 2024 00:22

FamDav: Mar 29, 2008

Warbird posted:

This isn�t strictly DevOps but close enough that I figure folks here might have an idea on the matter.

I grabbed one of these new M2 MacBooks with the intent, among other things, to use some of the extra beef to spin up some VMs in order to dink around with K8s finally. Lo and behold it seems that Virtualbox support on the processor is spotty right now and anything I emulate via any means will also be ARM based. That isn�t awful but most of the reason I didn�t already do this on a few RasPis was that ARM support of containers and most K8s guides/walkthroughs don�t usually line up.

What�s the play here? Wait for Virtualbox to get in a usable state? Pay for a Parallels sub (have standard, have to have Pro for Vagrant comparability)? Try and convince the wife to let me spend some $$$ on a Proxmox instance with more than 16GB of RAM?

https://github.com/lima-vm/lima

# ¿ Feb 5, 2023 23:59

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread