Continuous Integration/build engineering/devops thread

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Dunno how they are for companies, but I've been using Porkbun and reasonably happy with them. They have an API (I haven't used it) so it seems like there's at least the possibility of running it at some level of scale.

# ¿ Jun 18, 2023 20:16

Adbot: ADBOT LOVES YOU

# ¿ May 3, 2024 23:07

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Collateral Damage posted:

If any of you use Moq in your testing you should probably yank it out, or at the very least pin it at version 4.18

The project owner intentionally hid email-harvesting malware in a minor update yesterday.

https://github.com/moq/moq/issues/1372

This is a loving landmine and y'all should read it. Basically the maintainer gets mad he isn't getting paid for OSS (which like, fair, real problem) and decides the way to fix it is to bundle a closed-source compiled dll into his project that scrapes your email from your git repo, hashes it with SHA-256, and uploads it to his server so he can pause your loving build and nag you for not being a PERSONAL loving SPONSOR. This is literally about nagging actual individual devs, not trying to get companies to cough up.

# ¿ Aug 10, 2023 07:50

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Docjowles posted:

Speaking of Hashicorp apparently they made terraform and some other poo poo closed source. My main takeaway from the announcement was that it mentioned Vagrant and it triggered the Obi-Wan �now there�s a name I haven�t heard� meme

Sucks, but still leagues better than the poo poo the Moq dev pulled.

# ¿ Aug 17, 2023 04:58

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Hughmoris posted:

Sunday Career Chat

I know this is an overly broad question but is it possible to work in DevOps without touching much application code? I'm self-teaching Docker and K8s at the moment and I'm enjoying it but my eyes start to glaze over when I start looking at application code. I am able to read and write basic apps but I don't particularly enjoy it. I really enjoy scripting/automating and have more fun putting the lego blocks together.

For those of you working professionally in this field, are you writing lots of .NET/Node/Java etc? Is it just inherent to the DevOp job? Maybe something SRE is more my speed?

(I ask here because I'm a data analyst and don't personally know anyone doing this work)

Highly agree with what others said. I'm working at a FAANG company and as far as I can tell our term for this is not shared in the industry, so job titles are a mess, but my job is split between managing a bunch of infrastructure and working on some dev projects to build tools to manage it better, but I suspect a lot of teams like this wouldn't even have a project as big and semi-structured as we do. Most of the coding my team does is much smaller script stuff though, so my project is kind of an outlier.

Python is probably a major language for this sort of thing, once you get past Bash/Powershell (depending on your infrastructure) and it's what we use primarily for all of our internal tooling, although we have a few Java projects floating around too. Python is a great middle ground for languages for this sort of thing because it's robust enough to do all sorts of things and patterns, while remaining easy enough to read and write that you can slap out code very quickly in it. (You can also write some terrible inscrutable code in it, but coding standards can help stave this off.) It doesn't stand up in performance to Java or C# by any means but as it turns out if what you need to do is 'Modify this AWS account thing to do X instead of Y' it doesn't fuckin' matter if it takes 2 seconds instead of .02 since some human's running it by hand and typing in the commands will take longer than executing them anyway.

https://books.google.com/books?id=81UrjwEACAAJ is the book that as far as I can tell, helped popularize the idea of SRE, and as Docjowles said it was more of a 'hey what if a dev learned literally anything about operations' and less of 'what if an ops guy could write...code??!??'. Different companies will have different approaches to that spectrum, and honestly some teams even have people on any end of that; my team has a good scattering of folks at all points, I think I'm probably somewhere solidly in the middle.

Our coding level for (what do you call the level between junior and senior? That level) interviews tends to be along the lines of 'can you do a reasonably straightforward LeetCode question (something medium difficulty)' on the coding side, but otherwise we want to know that you understand infrastructure. From my perspective, that means you can talk about load balancers and stuff like that at a high level, and maybe dig into some stuff, but our size means that you don't have to be an expert in any single thing necessarily.

That being said, my company's hiring requirement does require coding knowledge, so I would recommend playing around on leetcode or codewars or whatever and getting used to working through algorithm questions. It can trip you up if you're not familiar, but in my experience we don't tend to ask the sort of bizarre ones that are like 'invert a binary tree', and generally give out questions that are more along the lines of 'Here's an imaginary situation you could conceivably run into, solve it in a shared online editor and talk through your steps'

# ¿ Aug 28, 2023 05:50

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Docjowles posted:

Yeah pager duty is the gold standard and unless it is cost prohibitive I would look there first. You can easily manually raise incidents by email or api or whatever you want. It has integrations with literally everything, including slack. It�s an excellent product and I would want a good reason before looking at anything else

E: removed some stuff after rereading OP

Absolutely PagerDuty is a great product. I've used it once or twice and also used Big Tech replacements, which tend to be...fine, compared to PD at best. A single outage could cost you more than a few years of PD based on their pricing plans (depending on impact/etc), so I'd definitely recommend it.

If there's no financial impact to these issues or the cost is prohibitive otherwise then Hadlock might be right; this might something you can fix via process.

# ¿ Oct 2, 2023 20:55

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Some minor notes since I'm on my phone:

Help desk is a business group that helps users with their computers, not a product category. I think you're thinking of a ticketing system instead. If you're already using Jira there might be integrations available for PD but I've never used Jira so no idea.

# ¿ Oct 3, 2023 03:47

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Trapick posted:

If it prevents me getting a 2am call yes absolutely thank you.

You just get a call manually from your boss instead and he's going to be a lot grumpier than the pager would be.

# ¿ Oct 3, 2023 23:29

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

IMO Basically your entire company should, if at all possible, use the same ticketing system. I recognize that means that some uses are going to be suboptimal but that's not nearly as much of a pain in the rear end as the alternative.

# ¿ Oct 5, 2023 23:35

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

When I worked at Microsoft there was a period of time where my team has six different ticketing systems to use that had zero interoperability and it was a loving nightmare. I also used Visual Studio Online / Azure DevOps as our incident management tooling, so that was an interesting experience.

Before I left Azure had done a reasonably good job of browbeating everyone into the same internal tool for ticketing, which was a huge relief, although it didn't include task management, so you still basically had ADO + InternalTicketingSystem. My new BigTechCompany has literally one with different frontends for task/incident and while it's also not perfect it's such a better system.

(Side note: obviously customer-facing ticketing is probably not something you can easily integrate with your internal one for safety reasons so ignore that.)

# ¿ Oct 6, 2023 18:33

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Anyone have any familiarity with using Jupyter Notebooks for operational runbook work? Netflix has been talking about it for years and I'm about to pitch something similar on my team, curious if anyone has experience with it.

# ¿ Nov 25, 2023 20:08

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

12 rats tied together posted:

I found that it doesn't actually solve any problems in a meaningful way for me and instead introduces a ton of new ones like managing the entire notebook ecosystem.

For my immediate needs I have a stdout plugin for ansible that hijacks the terminal and replaces it with a textualize app. While I was figuring out how to do this, I noticed that textualize has textualize-web which lets you run apps in a browser in addition to the terminal. I'd probably start there instead of jupyterhub if I was building something from-scratch for operational work specifically.

Yeah, the trick is that this is for a larger team, so some standardization is probably appropriate. We will have to figure out the managing the notebook ecosystem problem for sure, but our team is large enough that's probably not impossible, and I think the benefits of tightly coupling the documentation to the actions is pretty meaningful. We'd also be using JupyterHub or something hosted, instead of just 'alright folks, go make some notebooks', so that helps with a lot of the weird python overhead of environments/etc.

Textualize is neat, and I'm a fan, but it isn't really an analogue to Jupyter Notebooks other than 'quick graphical UI', and that's not really the important part for me about the proposed setup. You could totally do a pseudonotebook setup with cells of rich text/etc, it just seems like an awful lot of overhead.

For some context, we have some runbooks that are an awful lot of 'Grab a bunch of data and review it, then do something based on that output'. Long term, we want to turn a lot of this into fully automated solutions using stuff like SQS and Step Functions/etc, but there's a huge gap between '100% paper docs' and 'fully automated solutions', and I see this as sort of a middle ground that allows for a lot of iterative improvement on the process.

# ¿ Nov 25, 2023 21:57

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

12 rats tied together posted:

Yeah, I get it, I just didn't find that editing python code in cells and running the cells really lent itself well to what I was looking for, which was "psychological safety for ansible playbooks" (analogous to step functions in this context).

I don't know what kind of jobs you're running but my plans here are basically textual+ansible-playbook for interactive tasks that require human attention, the stdout plugin just shunts events from ansible's task queue into the textual app which displays their status in a nice UI + relevant graphs -- "You're running this job, which is doing this to the cluster, so this graph line should go down", etc. Because it happens at the task queue level it also lets us pause jobs, interactively retry individual tasks or groups of tasks, etc. Theoretically we can even live edit the discovered data but it's probably best not to build that feature.

For reactive tasks the plan is ansible-rulebook subscribed to AWS SQS, most likely. It's either SQS or Kafka but internally for us Kafka has weird uptime problems / lack of support, which is not something I want attached to the system that runs cluster-wide jobs with potentially destructive powers. The exciting thing about rulebook is that it uses the exact same tasks format as playbook so we never really have to migrate from one system to another. We have jobs xyz and we think it's safe to run them automatically, we just write a rulebook that collects preconditions and fires them off.

The jobs are janitoring workflows for a distributed database, so it is usually the case that we make some change and want to ensure that the cluster is coping with it correctly and not descending into a death spiral/cascading failure. Needing to alt tab between job interface and database health (in various forms) has a real attention span cost.

Textual lets us simply display the graph in the cli alongside the job, along with descriptions of what should be happening, hotkeys to pause, rollback, post to slack, etc. Simply type a prometheus query and hit enter instead of remembering that in order to run a prometheus query in grafana you have to open explore, and then to do that you need to right click -> open in new tab, because grafana highjacks cmd+click for some reason, and so on.

I expect that if we had a singular interface to our infrastructure like Ansible that probably would be more useful, but we don't (for a bunch of reasons beyond my immediate scope unfortunately). The editing cell stuff is mostly fluff for us as well and I plan on locking down runbook cells to read-only for most sections, but it has a few benefits - namely, getting folks not as used to Python a bit more visibility, and some ability for incremental changes.

You can probably (successfully) argue that if I'm not using the live edit feature of JupyterHub, I'm missing the point, but If there's an alternative that doesn't involve me rolling my own entire solution, I'd use that instead. So I guess that's a good question - is there anything like that?

The other ancillary benefit is having a section (or second deployment) that is both editable and shareable, so you can collaborate on deeper dive investigations or other issues that require a lot more going off script or otherwise mucking around.

# ¿ Nov 26, 2023 02:22

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Sagacity posted:

I worked for a startup that probably does exactly what you want, but I'm not sure how active development is right now.

Looking into it, this does a lot that we already have solutions for internally, with agents/etc. I'm not looking for something that solves that much, because we have a lot of robust, internal solutions to a lot of it that aren't public offerings/etc, so we're not using G Cloud / AWS stuff all this sort of software ties into. There's apparently a beta for coding providers so we could maybe write a shim, but it's in Rust, and nobody on my team is going to learn Rust any time soon.

For a smaller shop that directly used stuff all on AWS/etc, I bet this would be pretty cool though.

# ¿ Nov 26, 2023 10:42

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

12 rats tied together posted:

the problems with yaml, including every problem from that article, are best solved by using a version of the yaml spec that is newer than the one from 2006

True, but as it points out, PyYAML, the most common Python YAML library (which I use because I had never heard of this problem before!) still uses the 1.1 spec and therefore has a lot of the weird problems. This is probably not uncommon for other languages too.

# ¿ Dec 17, 2023 22:23

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Yeah I'd be way more likely to use JSON over xml if it's supposed to be human readable, just format it properly and it's at least fine.

# ¿ Dec 17, 2023 23:07

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Hadlock posted:

My org is basically a newborn baby when it comes to operational maturity. We had an issue today and one guy went to logon to the server to look at the problem and couldn't get on because the other guy was rebooting the box to fix it that way and I haven't looked at the time stamps yet but not sure if the box was locked up or he couldn't get on the machine because it was rebooting. Too many cooks in the kitchen

Basically we're taking a giant messy poo poo while eating hot fudge on the toilet and can't tell where one ends and the other begins

I'm just going to copy the atlassian post mortem report template. I've been using that in various forms since forever

What about outage policy and procedure docs? Like, I know roughly what they're gonna say, but I can't find anything good to use as a baseline template. Thoughts?

Gonna go over the training plan with my boss Monday and then do a 30 minute training Tuesday with the post mortem from today. Am I missing anything

So I'll start by just saying up front I don't have any convenient docs I can link you, but I can at least probably confirm your suspicions about process; I managed high-impact outages for a major tech company for close to ten years so I've got a lot of experience in the area. I've written some blog posts around oncall that might be relevant so let me know if you're interested.

Docjowles posted:

You don�t have to go 0 to 60 in a day in terms of policy and procedure around incidents. But some baseline level of communication around who is responding and providing status out to the rest of the org is a good start. Cause yeah I�ve been there where you�re knee deep in troubleshooting / remediation and then some other ~~idiot~~ helpful coworker Kool Aid Mans in to save the day.

This is a solid bit of starting advice, all of it - especially the 'don't try and jump straight to perfect'. Start here for sure, but I'll add in a bit more detail.

Bhodi posted:

without trying to be mean, a realistic look at your org�s willingness to adopt any of those policies and your personal resiliency at how hard and long you�re willing to tilt at a windmill. how you might take those docs and that training getting entirely ignored or actively resisted.

This is also good advice and definitely ties into the 'don't jump straight to perfect' comment; crisis / incident management stuff tends to be bad news delivery and people are going to resent it if you go too hard, but you can generally convince people of basic poo poo, especially team members/etc. Pick your battles, basically.

So going into more details:

When something happens, you should have a clear idea of who's working on it. Generally, I'd say this should mean 'there's a ticket in our ticketing system and we track that' - this should be assigned to a specific person or at least a specific role, and when someone starts working on it they should put literally anything into the ticket to indicate they've begun investigating. "ack", "looking", etc are all perfectly fine, because the point is that you have a way to say 'oh X is looking at this.' The goal of stuff like this is to put this information where people are looking, so ideally people aren't just hopping around trying to fix things without checking for tickets/etc first.

When anything happens that ends up in anything your company would consider an 'incident' and it didn't generate a ticket, then you have a monitoring gap - so go figure out how to make sure there's an automated ticket next time. Ideally, try and fix it that way instead of going down the path of 'you have to create a ticket by hand for every xyz bullshit' because nobody likes that. It's not bad advice, but nobody likes it and at your current maturity state it's not worth fighting about.

I don't actually know if you ever need a standards doc for this, but if you feel like writing one, keep it brief and then get buy in from your coworkers/etc.

The Atlassian Postmortem Template seems extremely reasonable, although honestly longer than necessary. I'd review the rest of their related docs because I assume it's all probably decent and might get you something you can throw at managers as 'sources'.

Falcon2001 fucked around with this message at 02:42 on Jan 20, 2024

# ¿ Jan 20, 2024 02:28

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Whenever trying to improve operational excellence, I'd start from 'how do I get people on board with this?' - in this case I'd try and lean on the fact that nobody likes wasting their time or playing a blame game; so trying to get people in the habit of either talking about what they're doing openly, or tracking it in a ticket is something you can help people see the benefits of.

Certain operational things (security patching, etc) kind of fall into the 'nobody really likes doing it but we have to', but a lot of things can be argued for from a position of self-interest, or at least a position of 'here's how it helps people you actually give a poo poo about like your coworkers'.

Edit: this reminded me of another option. If a ticketing/monitoring system is out of your team's reach or simply impossible to implement, getting people in the habit of declaring what they're doing when they're investigating in a slack channel or something like that could probably be a workable alternative; the point is letting folks know what's going on. Even this is an improvement over 'eight cooks blindly wandering around a kitchen'.

Falcon2001 fucked around with this message at 21:34 on Jan 21, 2024

# ¿ Jan 21, 2024 21:26

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Docjowles posted:

Working on tooling to manage 10k servers because it�s the right choice for the business? Now we�re talking.

This is me but it's more like 100k I think.

# ¿ Jan 25, 2024 08:14

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Yeah cloud providers let you scale up and down or experiment without committing to capital. Alternately there's a lot of useful services and stuff, it's not just "rent a server in the cloud".

# ¿ Jan 26, 2024 06:11

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

It is worth noting that on-prem is bit pain in its own right because you're having to manage that hardware somehow. If you own the datacenter then you've got to hire staff to run it, maintain the equipment, pay for the power, run the HVAC, etc. Like yeah, an EC2 VM isn't cheap, but definitely people forget about all the things you don't have to worry about directly when you're on a cloud provider. I think there's still plenty of reasons to do on-prem but make sure you're thinking through the whole cost of ownership.

Similarly, I agree that the hosted services like S3 / SQS / DDB (and their Azure/Gcloud equivilents) are all extremely useful too and IMO are one of the biggest arguments for a cloud provider, and often help you avoid things like 'oh yeah we need a server to host files sometimes' or 'I guess we need to build up a message bus so that's a few more servers'. At certain scale levels it's a total slam dunk from a cost perspective.

I was extremely skeptical of cloud offerings back when my only exposure was 'lift and shift onto the cloud' but once you start thinking about the service offerings and the reasonably easy interoperation between those services it gets a lot more interesting.

# ¿ Jan 27, 2024 01:41

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

drunk mutt posted:

Pick, the, right, tool, for, the, job.

It's weird, I could have sworn you said "More unnecessary EC2 VMs".

# ¿ Jan 27, 2024 09:51

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Hadlock posted:

Is there a good boilerplate policy for disaster recovery SLA with the c suite? I was casually talking about 24 hours for basic functionality, and 7 days to return to full functionality

I think just spinning up a database server from scratch, unlocking the off site backup and doing a full database restore would take us ~4 hours

I've done some DR exercises before and I think the biggest point is that there isn't like a flat policy. DR is basically always going to be 'in scenario X, our time to recover is Y', and you have to pick how big of a problem you think you can solve.

For example, some of our DR SLAs were 5 minutes at that role, because it was a stateless frontend service with automatic error detection and failover, so for the scenario of 'what happens if the host pool in X region goes down' our DR was 5 minutes.

On the other hand, our DR for 'a meteor strikes {hq_city} wiping out the campus and all occupants', that was our actual line in the sand for 'anything this problematic or higher is out of scope and it would be disastrous to our business, it's close up shop time'. Most things fell somewhere in the middle and had scaling responses.

Most of the DR stuff comes down to 'if feature/functionality/service X goes down, do you have a plan to recover from it, and how long will it take to execute the plan?' so if you want a boilerplate, that's the most straightforward answer. You also need to test your DR - an untested DR plan is meaningless, even if you're just trying to execute the failover/etc steps. I would also note that it's not necessarily a bad thing to say 'this would require a manual rebuild of XYZ and would take an estimated Z months' because if that really is the answer then it's worth leadership understanding it, and then figuring out how much the work to offset it would be.

Sometimes, especially if your budget is shoestring or the service isn't that important, there's lots of times where that line is pretty low down the list.

Edit: gently caress it, I'll keep going.

I'm not familiar with standards for this stuff, so there might be one I'm not aware of. I would say if there isn't, book dedicated time to sit down and start looking at your system. Identify all the most likely failure points, and then document how you'd approach recovering from it. This should include stuff like 'what if our main datacenter goes offline due to an idiot with a backhoe / cooling failure / etc', and stuff like that. Make sure there's documentation for that approach, and that the documentation has some sort of mechanism to stay up to date over time, so you don't go to enact it to find out you changed your networking stack since it was written, and now you're having to ad-hoc gently caress with DNS while leadership is trying to crawl all the way up your rear end in a top hat in hopes of puppeting your body to a faster mitigation time.

If the service you're investigating is some random feature that could go down for a while, be less worried about it. If it's the main service that keeps your company in the green financially, worry more about it.

Falcon2001 fucked around with this message at 05:01 on Feb 11, 2024

# ¿ Feb 11, 2024 04:57

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Hadlock posted:

I worked at a place that did real time trading. You've never heard of them but it was a thing they offered. Anyways as a result they were regulated by FINRA and had a full, manned , DR site in some basically empty nondescript 7 story office building near a major interstate, full of decade old desktops that were powered on just rotting running a fully patched copy of Windows 7 enterprise and dust covers on the keyboards, and big signs hanging from the ceiling saying "ACCOUNTING" and "CLEARING" and "TRADE DESK" etc. full on "meteor strikes hq building" backup. Before I left I raided their office supply cabinet (that had probably never been opened) for a very nice collection of wilcott flexible stainless steel rulers. Very "liminal spaces" type space

ANYWAYS

Annually we would, on some three day trading weekend, the DR team would roll over to the B site. Usually it was the team lead, who had been there 15 years and done this test 14 times with his blindfold on basically. He was supposed to follow the printed manual in the binder. Well this year they sent him home 5 minutes before the exercise happened, and had my coworker do it. Instead of 4 hours it took them like 36 hours and they uncovered all sorts of tribal knowledge that wasn't recorded anywhere. The CTO got involved at one point because they weren't sure they could fail back over to the A site before trading started Monday morning at the bell and in fact I think we ran that whole week from the B site and then had to switch back over the next weekend. All software testing and deployment was halted for the whole week because even though we had testing servers at both sites, we lost the A servers and testing couldn't run in 24 hours on just the B servers

TL;DR yeah always test your DR plan

I don't like a whole lot of modules but I do at every place I work try to setup a weekly job that will deploy a copy of production framework (Kubernetes, ArgoCD, external dns, cert manager, RDS, etc), and then tear it back down, so that I have some level of confidence in our DR process, and then trust the IT guys are taking care of our off-site database backup and we can get it at some point

Earlier in my career, I went from working at a service that was almost entirely a query-based stateless service with a sub five minute failover time, to working in the org with payments and billing and stuff. The first job didn't even bother doing DR drills for the most part because every day we were constantly shifting traffic around (if we hadn't had an outage to test stuff like a region going offline, they would test that too periodically, we just had a lot of our DR systems baked into normal operation).

The new job I walked in and got an email talking about how excited they were that they were able to fail over to a secondary region and it only took seventy-two straight hours, handing off from person to person the whole goddamn weekend. At least some of those people were overseas, but yeah. Crazy weekend, and they were SO loving EXCITED. I was horrified.

Edit two: Oh yeah, any process that one person does is a huge red flag for weird hidden knowledge never documented. I'm handling something like that now and the amount of weird bullshit we've dug up is crazy.

Falcon2001 fucked around with this message at 10:55 on Feb 11, 2024

# ¿ Feb 11, 2024 10:49

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Hadlock posted:

So I ran across a blog post the other day that had an interesting term, "reference architecture" specific to platform architecture/DevOps and that's sent me deep down a philosophical rabbit hole. I've really been struggling to find/define "best practices" or "state of the art" I think it's loosely defined as "containers using git ops and iac"

Should reference architectures be opinionated, like Ruby on Rails? Or left wide open
Should reference architectures be cross-cloud (AWS/gcp/azure etc)
Should reference architectures support all deployment types? CRUD, stream processing, LLM etc
Where does the definition of a reference architecture start and stop? Is it a helm chart, or terraform that deploys the cluster + bootstrappy charts to provide XYZ base functionality? Plus flux/Argo?
Are IAM/Secrets management/password rotation part of the reference architecture?
How do you encode/validate best practices across all "layers" of the reference architecture
Which DNS provider(s) would you support
Is GHA/Jenkins/spinnaker part of this? It's turtles all the way down where do you draw the line

I'm pretty close, I think, to publishing a generic "reference architecture" similar to what I've built at work that uses terraform, k8s, ArgoCD, GitHub actions, but it lacks ownership of IAM and doesn't have any automation of secrets management beyond basic kms access for one user per environment

Most of the "blogs" or medium.com articles I've seen are written by guys who are trying to build a reputation and seem like they just barely know what the hell they're doing, or the toy demo they're deploying only works in a vacuum and is not extensible and in general garbage and you spend an inordinate amount of time splicing a working answer into your existing IaC

Nobody here has pointed me at a reference architecture but I guess I'll take a stab at it one more time: does anyone have a favorite reference architecture they like, or have seen that's moderately kept up to date?

Open to any and all commentary on the subject up to and including "this is a stupid idea, nobody is only creating a crud app, a simplified reference architecture is a stupid idea it's barely better than the medium.com articles"

I suspect the answer is "If it's simple enough to make into a medium.com article it's useless, but if it's complicated enough to be real-world usable, it's too complex to be summarized that quickly."

A lot of business processes are highly dependent on your company; we recently defined our fleet standards and it was like...thousands of various bits and bobs at the end of the day. Many companies might never care about any of that, but the scale we operate at changes a lot of what we care about - including that some things a smaller company would care about, we simply ignore because it's not meaningful at scale. That doesn't make our approach right and I wouldn't recommend it.

Another way to think of this is how drug companies do manufacturing - they tend to have a 'platform', or sort of a default state; you can go off-platform, but it means that you're on the hook to define the deviations from that platform, and anyone you work with inside the company isn't guaranteed to know how you operate. The term 'platform' is pretty overloaded in IT, but I think it's a pretty reasonable way of looking at things, so what you're basically trying to do is setup a default 'platform', if I'm understanding you correctly. I think that even a naive or short-sighted approach to that problem will teach you and your company a lot, so I think you're heading in a good direction.

(That's a lot of words to fail to give you what you're asking for, unfortunately.)

# ¿ Feb 24, 2024 01:28

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

The Fool posted:

prepare for disappointment

The chance that any LLM manages to figure this out from the lovely documentation every company has is essentially somehow a negative number. I assume at this point that the marketing team for copilot at MSFT is basically wholly disconnected from any actual engineering team and is just trying to get as much cocaine as possible before the hype dies.

# ¿ Feb 28, 2024 06:02

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

FISHMANPET posted:

I still want to know when Omegastar will support ISO dates (there was a video from November 2021 where it was delayed again).

I was working with a junior on my team and told him "Anytime someone tells you in software development that there is only ever one right answer to a problem, they're absolutely full of poo poo, with one exception. ISO-8601 is the only acceptable datetime string format, and anyone who says otherwise is a terrible person with brain worms. But yeah everything else has multiple answers."

I'm right. :colbert:

The Iron Rose posted:

the best cloud bullshit YouTuber out with yet another banger

https://www.youtube.com/watch?v=ia8Q51ouA_s

Dunno if this is common knowledge, but he works at one of the big FAANG companies, which explains an awful lot. I know a few people who know him IRL. This latest video felt like he was spying on me.

# ¿ Mar 2, 2024 20:15

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

necrobobsledder posted:

I don't blame a lot of people for staying far behind modern software trends given how incredibly fashion-driven our silly industry is for supposedly such an "engineering" culture we're supposed to have. But I guess it's about resume-driven development for making sure we don't get stuck at companies that pay on the other end of the bimodal distribution of software, which is honestly the dominant part of the distribution of our industry.

I'm seeing an awful lot of reverse-hipster "monoliths are cool again" blogposts. Collapsing a bunch of over-engineered code into much easier to understand constructs for engineering teams an order of magnitude smaller and using strangler pattern effectively at the LOB level is what I've been doing for a while now partly because I just really, really, really like to see code disappear while all my tests and customers keep on trucking right along.

The book Kill It With Fire does a good job of talking about the benefits of monoliths without seeming like reactionary reverse-hipster stuff - basically that there's a lot of benefits to them, and if you try and preemptively scale your stuff into microservices you introduce a lot of inefficiencies to the development process that aren't great. It also doesn't pretend that monoliths don't have problems and talks about the right time to migrate away/etc. I really do love that book.

# ¿ Mar 11, 2024 20:04

Adbot: ADBOT LOVES YOU

# ¿ May 3, 2024 23:07

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

My big tech company uses (as far as I know) a title for SRE/devops that's basically unique to us, so I'm vaguely screwed. It does say Developer in the title though so that's nice.

# ¿ Apr 5, 2024 07:06

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread