|
Dunno how they are for companies, but I've been using Porkbun and reasonably happy with them. They have an API (I haven't used it) so it seems like there's at least the possibility of running it at some level of scale.
|
# ¿ Jun 18, 2023 20:16 |
|
|
# ¿ May 3, 2024 23:07 |
|
Collateral Damage posted:If any of you use Moq in your testing you should probably yank it out, or at the very least pin it at version 4.18 This is a loving landmine and y'all should read it. Basically the maintainer gets mad he isn't getting paid for OSS (which like, fair, real problem) and decides the way to fix it is to bundle a closed-source compiled dll into his project that scrapes your email from your git repo, hashes it with SHA-256, and uploads it to his server so he can pause your loving build and nag you for not being a PERSONAL loving SPONSOR. This is literally about nagging actual individual devs, not trying to get companies to cough up.
|
# ¿ Aug 10, 2023 07:50 |
|
Docjowles posted:Speaking of Hashicorp apparently they made terraform and some other poo poo closed source. My main takeaway from the announcement was that it mentioned Vagrant and it triggered the Obi-Wan “now there’s a name I haven’t heard” meme Sucks, but still leagues better than the poo poo the Moq dev pulled.
|
# ¿ Aug 17, 2023 04:58 |
|
Hughmoris posted:Sunday Career Chat Highly agree with what others said. I'm working at a FAANG company and as far as I can tell our term for this is not shared in the industry, so job titles are a mess, but my job is split between managing a bunch of infrastructure and working on some dev projects to build tools to manage it better, but I suspect a lot of teams like this wouldn't even have a project as big and semi-structured as we do. Most of the coding my team does is much smaller script stuff though, so my project is kind of an outlier. Python is probably a major language for this sort of thing, once you get past Bash/Powershell (depending on your infrastructure) and it's what we use primarily for all of our internal tooling, although we have a few Java projects floating around too. Python is a great middle ground for languages for this sort of thing because it's robust enough to do all sorts of things and patterns, while remaining easy enough to read and write that you can slap out code very quickly in it. (You can also write some terrible inscrutable code in it, but coding standards can help stave this off.) It doesn't stand up in performance to Java or C# by any means but as it turns out if what you need to do is 'Modify this AWS account thing to do X instead of Y' it doesn't fuckin' matter if it takes 2 seconds instead of .02 since some human's running it by hand and typing in the commands will take longer than executing them anyway. https://books.google.com/books?id=81UrjwEACAAJ is the book that as far as I can tell, helped popularize the idea of SRE, and as Docjowles said it was more of a 'hey what if a dev learned literally anything about operations' and less of 'what if an ops guy could write...code??!??'. Different companies will have different approaches to that spectrum, and honestly some teams even have people on any end of that; my team has a good scattering of folks at all points, I think I'm probably somewhere solidly in the middle. Our coding level for (what do you call the level between junior and senior? That level) interviews tends to be along the lines of 'can you do a reasonably straightforward LeetCode question (something medium difficulty)' on the coding side, but otherwise we want to know that you understand infrastructure. From my perspective, that means you can talk about load balancers and stuff like that at a high level, and maybe dig into some stuff, but our size means that you don't have to be an expert in any single thing necessarily. That being said, my company's hiring requirement does require coding knowledge, so I would recommend playing around on leetcode or codewars or whatever and getting used to working through algorithm questions. It can trip you up if you're not familiar, but in my experience we don't tend to ask the sort of bizarre ones that are like 'invert a binary tree', and generally give out questions that are more along the lines of 'Here's an imaginary situation you could conceivably run into, solve it in a shared online editor and talk through your steps'
|
# ¿ Aug 28, 2023 05:50 |
|
Docjowles posted:Yeah pager duty is the gold standard and unless it is cost prohibitive I would look there first. You can easily manually raise incidents by email or api or whatever you want. It has integrations with literally everything, including slack. It’s an excellent product and I would want a good reason before looking at anything else Absolutely PagerDuty is a great product. I've used it once or twice and also used Big Tech replacements, which tend to be...fine, compared to PD at best. A single outage could cost you more than a few years of PD based on their pricing plans (depending on impact/etc), so I'd definitely recommend it. If there's no financial impact to these issues or the cost is prohibitive otherwise then Hadlock might be right; this might something you can fix via process.
|
# ¿ Oct 2, 2023 20:55 |
|
Some minor notes since I'm on my phone: Help desk is a business group that helps users with their computers, not a product category. I think you're thinking of a ticketing system instead. If you're already using Jira there might be integrations available for PD but I've never used Jira so no idea.
|
# ¿ Oct 3, 2023 03:47 |
|
Trapick posted:If it prevents me getting a 2am call yes absolutely thank you. You just get a call manually from your boss instead and he's going to be a lot grumpier than the pager would be.
|
# ¿ Oct 3, 2023 23:29 |
|
IMO Basically your entire company should, if at all possible, use the same ticketing system. I recognize that means that some uses are going to be suboptimal but that's not nearly as much of a pain in the rear end as the alternative.
|
# ¿ Oct 5, 2023 23:35 |
|
When I worked at Microsoft there was a period of time where my team has six different ticketing systems to use that had zero interoperability and it was a loving nightmare. I also used Visual Studio Online / Azure DevOps as our incident management tooling, so that was an interesting experience. Before I left Azure had done a reasonably good job of browbeating everyone into the same internal tool for ticketing, which was a huge relief, although it didn't include task management, so you still basically had ADO + InternalTicketingSystem. My new BigTechCompany has literally one with different frontends for task/incident and while it's also not perfect it's such a better system. (Side note: obviously customer-facing ticketing is probably not something you can easily integrate with your internal one for safety reasons so ignore that.)
|
# ¿ Oct 6, 2023 18:33 |
|
Anyone have any familiarity with using Jupyter Notebooks for operational runbook work? Netflix has been talking about it for years and I'm about to pitch something similar on my team, curious if anyone has experience with it.
|
# ¿ Nov 25, 2023 20:08 |
|
12 rats tied together posted:I found that it doesn't actually solve any problems in a meaningful way for me and instead introduces a ton of new ones like managing the entire notebook ecosystem. Yeah, the trick is that this is for a larger team, so some standardization is probably appropriate. We will have to figure out the managing the notebook ecosystem problem for sure, but our team is large enough that's probably not impossible, and I think the benefits of tightly coupling the documentation to the actions is pretty meaningful. We'd also be using JupyterHub or something hosted, instead of just 'alright folks, go make some notebooks', so that helps with a lot of the weird python overhead of environments/etc. Textualize is neat, and I'm a fan, but it isn't really an analogue to Jupyter Notebooks other than 'quick graphical UI', and that's not really the important part for me about the proposed setup. You could totally do a pseudonotebook setup with cells of rich text/etc, it just seems like an awful lot of overhead. For some context, we have some runbooks that are an awful lot of 'Grab a bunch of data and review it, then do something based on that output'. Long term, we want to turn a lot of this into fully automated solutions using stuff like SQS and Step Functions/etc, but there's a huge gap between '100% paper docs' and 'fully automated solutions', and I see this as sort of a middle ground that allows for a lot of iterative improvement on the process.
|
# ¿ Nov 25, 2023 21:57 |
|
12 rats tied together posted:Yeah, I get it, I just didn't find that editing python code in cells and running the cells really lent itself well to what I was looking for, which was "psychological safety for ansible playbooks" (analogous to step functions in this context). I expect that if we had a singular interface to our infrastructure like Ansible that probably would be more useful, but we don't (for a bunch of reasons beyond my immediate scope unfortunately). The editing cell stuff is mostly fluff for us as well and I plan on locking down runbook cells to read-only for most sections, but it has a few benefits - namely, getting folks not as used to Python a bit more visibility, and some ability for incremental changes. You can probably (successfully) argue that if I'm not using the live edit feature of JupyterHub, I'm missing the point, but If there's an alternative that doesn't involve me rolling my own entire solution, I'd use that instead. So I guess that's a good question - is there anything like that? The other ancillary benefit is having a section (or second deployment) that is both editable and shareable, so you can collaborate on deeper dive investigations or other issues that require a lot more going off script or otherwise mucking around.
|
# ¿ Nov 26, 2023 02:22 |
|
Sagacity posted:I worked for a startup that probably does exactly what you want, but I'm not sure how active development is right now. Looking into it, this does a lot that we already have solutions for internally, with agents/etc. I'm not looking for something that solves that much, because we have a lot of robust, internal solutions to a lot of it that aren't public offerings/etc, so we're not using G Cloud / AWS stuff all this sort of software ties into. There's apparently a beta for coding providers so we could maybe write a shim, but it's in Rust, and nobody on my team is going to learn Rust any time soon. For a smaller shop that directly used stuff all on AWS/etc, I bet this would be pretty cool though.
|
# ¿ Nov 26, 2023 10:42 |
|
12 rats tied together posted:the problems with yaml, including every problem from that article, are best solved by using a version of the yaml spec that is newer than the one from 2006 True, but as it points out, PyYAML, the most common Python YAML library (which I use because I had never heard of this problem before!) still uses the 1.1 spec and therefore has a lot of the weird problems. This is probably not uncommon for other languages too.
|
# ¿ Dec 17, 2023 22:23 |
|
Yeah I'd be way more likely to use JSON over xml if it's supposed to be human readable, just format it properly and it's at least fine.
|
# ¿ Dec 17, 2023 23:07 |
|
Hadlock posted:My org is basically a newborn baby when it comes to operational maturity. We had an issue today and one guy went to logon to the server to look at the problem and couldn't get on because the other guy was rebooting the box to fix it that way and I haven't looked at the time stamps yet but not sure if the box was locked up or he couldn't get on the machine because it was rebooting. Too many cooks in the kitchen So I'll start by just saying up front I don't have any convenient docs I can link you, but I can at least probably confirm your suspicions about process; I managed high-impact outages for a major tech company for close to ten years so I've got a lot of experience in the area. I've written some blog posts around oncall that might be relevant so let me know if you're interested. Docjowles posted:You don’t have to go 0 to 60 in a day in terms of policy and procedure around incidents. But some baseline level of communication around who is responding and providing status out to the rest of the org is a good start. Cause yeah I’ve been there where you’re knee deep in troubleshooting / remediation and then some other This is a solid bit of starting advice, all of it - especially the 'don't try and jump straight to perfect'. Start here for sure, but I'll add in a bit more detail. Bhodi posted:without trying to be mean, a realistic look at your org’s willingness to adopt any of those policies and your personal resiliency at how hard and long you’re willing to tilt at a windmill. how you might take those docs and that training getting entirely ignored or actively resisted. This is also good advice and definitely ties into the 'don't jump straight to perfect' comment; crisis / incident management stuff tends to be bad news delivery and people are going to resent it if you go too hard, but you can generally convince people of basic poo poo, especially team members/etc. Pick your battles, basically. So going into more details: When something happens, you should have a clear idea of who's working on it. Generally, I'd say this should mean 'there's a ticket in our ticketing system and we track that' - this should be assigned to a specific person or at least a specific role, and when someone starts working on it they should put literally anything into the ticket to indicate they've begun investigating. "ack", "looking", etc are all perfectly fine, because the point is that you have a way to say 'oh X is looking at this.' The goal of stuff like this is to put this information where people are looking, so ideally people aren't just hopping around trying to fix things without checking for tickets/etc first. When anything happens that ends up in anything your company would consider an 'incident' and it didn't generate a ticket, then you have a monitoring gap - so go figure out how to make sure there's an automated ticket next time. Ideally, try and fix it that way instead of going down the path of 'you have to create a ticket by hand for every xyz bullshit' because nobody likes that. It's not bad advice, but nobody likes it and at your current maturity state it's not worth fighting about. I don't actually know if you ever need a standards doc for this, but if you feel like writing one, keep it brief and then get buy in from your coworkers/etc. The Atlassian Postmortem Template seems extremely reasonable, although honestly longer than necessary. I'd review the rest of their related docs because I assume it's all probably decent and might get you something you can throw at managers as 'sources'. Falcon2001 fucked around with this message at 02:42 on Jan 20, 2024 |
# ¿ Jan 20, 2024 02:28 |
|
Whenever trying to improve operational excellence, I'd start from 'how do I get people on board with this?' - in this case I'd try and lean on the fact that nobody likes wasting their time or playing a blame game; so trying to get people in the habit of either talking about what they're doing openly, or tracking it in a ticket is something you can help people see the benefits of. Certain operational things (security patching, etc) kind of fall into the 'nobody really likes doing it but we have to', but a lot of things can be argued for from a position of self-interest, or at least a position of 'here's how it helps people you actually give a poo poo about like your coworkers'. Edit: this reminded me of another option. If a ticketing/monitoring system is out of your team's reach or simply impossible to implement, getting people in the habit of declaring what they're doing when they're investigating in a slack channel or something like that could probably be a workable alternative; the point is letting folks know what's going on. Even this is an improvement over 'eight cooks blindly wandering around a kitchen'. Falcon2001 fucked around with this message at 21:34 on Jan 21, 2024 |
# ¿ Jan 21, 2024 21:26 |
|
Docjowles posted:Working on tooling to manage 10k servers because it’s the right choice for the business? Now we’re talking. This is me but it's more like 100k I think.
|
# ¿ Jan 25, 2024 08:14 |
|
Yeah cloud providers let you scale up and down or experiment without committing to capital. Alternately there's a lot of useful services and stuff, it's not just "rent a server in the cloud".
|
# ¿ Jan 26, 2024 06:11 |
|
It is worth noting that on-prem is bit pain in its own right because you're having to manage that hardware somehow. If you own the datacenter then you've got to hire staff to run it, maintain the equipment, pay for the power, run the HVAC, etc. Like yeah, an EC2 VM isn't cheap, but definitely people forget about all the things you don't have to worry about directly when you're on a cloud provider. I think there's still plenty of reasons to do on-prem but make sure you're thinking through the whole cost of ownership. Similarly, I agree that the hosted services like S3 / SQS / DDB (and their Azure/Gcloud equivilents) are all extremely useful too and IMO are one of the biggest arguments for a cloud provider, and often help you avoid things like 'oh yeah we need a server to host files sometimes' or 'I guess we need to build up a message bus so that's a few more servers'. At certain scale levels it's a total slam dunk from a cost perspective. I was extremely skeptical of cloud offerings back when my only exposure was 'lift and shift onto the cloud' but once you start thinking about the service offerings and the reasonably easy interoperation between those services it gets a lot more interesting.
|
# ¿ Jan 27, 2024 01:41 |
|
drunk mutt posted:Pick, the, right, tool, for, the, job. It's weird, I could have sworn you said "More unnecessary EC2 VMs".
|
# ¿ Jan 27, 2024 09:51 |
|
Hadlock posted:Is there a good boilerplate policy for disaster recovery SLA with the c suite? I was casually talking about 24 hours for basic functionality, and 7 days to return to full functionality I've done some DR exercises before and I think the biggest point is that there isn't like a flat policy. DR is basically always going to be 'in scenario X, our time to recover is Y', and you have to pick how big of a problem you think you can solve. For example, some of our DR SLAs were 5 minutes at that role, because it was a stateless frontend service with automatic error detection and failover, so for the scenario of 'what happens if the host pool in X region goes down' our DR was 5 minutes. On the other hand, our DR for 'a meteor strikes {hq_city} wiping out the campus and all occupants', that was our actual line in the sand for 'anything this problematic or higher is out of scope and it would be disastrous to our business, it's close up shop time'. Most things fell somewhere in the middle and had scaling responses. Most of the DR stuff comes down to 'if feature/functionality/service X goes down, do you have a plan to recover from it, and how long will it take to execute the plan?' so if you want a boilerplate, that's the most straightforward answer. You also need to test your DR - an untested DR plan is meaningless, even if you're just trying to execute the failover/etc steps. I would also note that it's not necessarily a bad thing to say 'this would require a manual rebuild of XYZ and would take an estimated Z months' because if that really is the answer then it's worth leadership understanding it, and then figuring out how much the work to offset it would be. Sometimes, especially if your budget is shoestring or the service isn't that important, there's lots of times where that line is pretty low down the list. Edit: gently caress it, I'll keep going. I'm not familiar with standards for this stuff, so there might be one I'm not aware of. I would say if there isn't, book dedicated time to sit down and start looking at your system. Identify all the most likely failure points, and then document how you'd approach recovering from it. This should include stuff like 'what if our main datacenter goes offline due to an idiot with a backhoe / cooling failure / etc', and stuff like that. Make sure there's documentation for that approach, and that the documentation has some sort of mechanism to stay up to date over time, so you don't go to enact it to find out you changed your networking stack since it was written, and now you're having to ad-hoc gently caress with DNS while leadership is trying to crawl all the way up your rear end in a top hat in hopes of puppeting your body to a faster mitigation time. If the service you're investigating is some random feature that could go down for a while, be less worried about it. If it's the main service that keeps your company in the green financially, worry more about it. Falcon2001 fucked around with this message at 05:01 on Feb 11, 2024 |
# ¿ Feb 11, 2024 04:57 |
|
Hadlock posted:I worked at a place that did real time trading. You've never heard of them but it was a thing they offered. Anyways as a result they were regulated by FINRA and had a full, manned , DR site in some basically empty nondescript 7 story office building near a major interstate, full of decade old desktops that were powered on just rotting running a fully patched copy of Windows 7 enterprise and dust covers on the keyboards, and big signs hanging from the ceiling saying "ACCOUNTING" and "CLEARING" and "TRADE DESK" etc. full on "meteor strikes hq building" backup. Before I left I raided their office supply cabinet (that had probably never been opened) for a very nice collection of wilcott flexible stainless steel rulers. Very "liminal spaces" type space Earlier in my career, I went from working at a service that was almost entirely a query-based stateless service with a sub five minute failover time, to working in the org with payments and billing and stuff. The first job didn't even bother doing DR drills for the most part because every day we were constantly shifting traffic around (if we hadn't had an outage to test stuff like a region going offline, they would test that too periodically, we just had a lot of our DR systems baked into normal operation). The new job I walked in and got an email talking about how excited they were that they were able to fail over to a secondary region and it only took seventy-two straight hours, handing off from person to person the whole goddamn weekend. At least some of those people were overseas, but yeah. Crazy weekend, and they were SO loving EXCITED. I was horrified. Edit two: Oh yeah, any process that one person does is a huge red flag for weird hidden knowledge never documented. I'm handling something like that now and the amount of weird bullshit we've dug up is crazy. Falcon2001 fucked around with this message at 10:55 on Feb 11, 2024 |
# ¿ Feb 11, 2024 10:49 |
|
Hadlock posted:So I ran across a blog post the other day that had an interesting term, "reference architecture" specific to platform architecture/DevOps and that's sent me deep down a philosophical rabbit hole. I've really been struggling to find/define "best practices" or "state of the art" I think it's loosely defined as "containers using git ops and iac" I suspect the answer is "If it's simple enough to make into a medium.com article it's useless, but if it's complicated enough to be real-world usable, it's too complex to be summarized that quickly." A lot of business processes are highly dependent on your company; we recently defined our fleet standards and it was like...thousands of various bits and bobs at the end of the day. Many companies might never care about any of that, but the scale we operate at changes a lot of what we care about - including that some things a smaller company would care about, we simply ignore because it's not meaningful at scale. That doesn't make our approach right and I wouldn't recommend it. Another way to think of this is how drug companies do manufacturing - they tend to have a 'platform', or sort of a default state; you can go off-platform, but it means that you're on the hook to define the deviations from that platform, and anyone you work with inside the company isn't guaranteed to know how you operate. The term 'platform' is pretty overloaded in IT, but I think it's a pretty reasonable way of looking at things, so what you're basically trying to do is setup a default 'platform', if I'm understanding you correctly. I think that even a naive or short-sighted approach to that problem will teach you and your company a lot, so I think you're heading in a good direction. (That's a lot of words to fail to give you what you're asking for, unfortunately.)
|
# ¿ Feb 24, 2024 01:28 |
|
The Fool posted:prepare for disappointment The chance that any LLM manages to figure this out from the lovely documentation every company has is essentially somehow a negative number. I assume at this point that the marketing team for copilot at MSFT is basically wholly disconnected from any actual engineering team and is just trying to get as much cocaine as possible before the hype dies.
|
# ¿ Feb 28, 2024 06:02 |
|
FISHMANPET posted:I still want to know when Omegastar will support ISO dates (there was a video from November 2021 where it was delayed again). I was working with a junior on my team and told him "Anytime someone tells you in software development that there is only ever one right answer to a problem, they're absolutely full of poo poo, with one exception. ISO-8601 is the only acceptable datetime string format, and anyone who says otherwise is a terrible person with brain worms. But yeah everything else has multiple answers." I'm right. The Iron Rose posted:the best cloud bullshit YouTuber out with yet another banger Dunno if this is common knowledge, but he works at one of the big FAANG companies, which explains an awful lot. I know a few people who know him IRL. This latest video felt like he was spying on me.
|
# ¿ Mar 2, 2024 20:15 |
|
necrobobsledder posted:I don't blame a lot of people for staying far behind modern software trends given how incredibly fashion-driven our silly industry is for supposedly such an "engineering" culture we're supposed to have. But I guess it's about resume-driven development for making sure we don't get stuck at companies that pay on the other end of the bimodal distribution of software, which is honestly the dominant part of the distribution of our industry. The book Kill It With Fire does a good job of talking about the benefits of monoliths without seeming like reactionary reverse-hipster stuff - basically that there's a lot of benefits to them, and if you try and preemptively scale your stuff into microservices you introduce a lot of inefficiencies to the development process that aren't great. It also doesn't pretend that monoliths don't have problems and talks about the right time to migrate away/etc. I really do love that book.
|
# ¿ Mar 11, 2024 20:04 |
|
|
# ¿ May 3, 2024 23:07 |
|
My big tech company uses (as far as I know) a title for SRE/devops that's basically unique to us, so I'm vaguely screwed. It does say Developer in the title though so that's nice.
|
# ¿ Apr 5, 2024 07:06 |