|
luminalflux posted:We use SQLAlchemy too, it's fairly easy with sqlalchemy.events.ConnectionEvents.before_cursor_execute to hook the cursor event and modify the query. We've done this with Datadog, where we actually kick open a new span inside this hook, add some tags we glean from various context handlers, and write the span and trace ID inside the comment as well as the callsite. This makes it easy to look at the process list, find the trace and span, but also see which goddamn file it's coming from.
|
# ? Sep 12, 2023 03:48 |
|
|
# ? May 21, 2024 18:23 |
|
Vulture Culture posted:Do you find this more helpful than using opentelemetry-instrumentation-sqlalchemy and just letting OTel sort it out? I haven't started looking into OTEL until fairly recently and then only in Go. We've been instrumenting using ddtrace for over 6 years now. I added the hook to comment about 3 years ago when OTEL was a lot rougher. But it's cool that someone added that to OTEL. edit: does opentelemetry-instrumentation-sqlalchemy add callsites? since we get a huge amount of value out of that (prob even more than the traces) since you can just go to the file
|
# ? Sep 12, 2023 04:05 |
|
Does anyone have some super baseline monitoring / alerting advice for a handful of lightweight ECS clusters? I have four clusters across two AWS accounts with no more than 8 services running on each. This is a fairly simple product without a lot of moving parts and all of the services are outputting logs into CloudWatch. I figure AWS-native tools will be just fine and I don't need to get anywhere near DataDog. In re: discussion about how much programming one should be able to do as a DevOps engineer, I've written very little code for the last several years (outside of a bunch of Terraform). I think I've been able to succeed by combination of being a generalist and having a decent amount of soft skills. I do see it as an impediment to my future, but it's hard to find time to take like, a Go course if I don't have time for it after work since there's a baby to be responsible for now.
|
# ? Sep 13, 2023 15:34 |
|
golang would be a waste of time, but python is a must
|
# ? Sep 13, 2023 15:38 |
|
But even then you're not building full applications. The most complicated stuff I make is still essentially gluing apis together in a giant rube Goldberg machine
|
# ? Sep 13, 2023 15:39 |
|
If you don't own your whole environment and you need to interoperate with things other people are building, that you don't control, you need to have sufficient data literacy and analysis skills to understand small details of your environment at scale. This is the key to migrations, and migrations are the key to actually bringing operational cost down by consolidating and automating platforms. You can lean on talented SWEs in your organization, but this makes ad-hoc exploratory analysis harder, and to make it work, you need to have a great idea up front of what hypotheses you're trying to invalidate. I've solved a lot of interesting problems by being a generalist expert in a dozen completely separate things, and being able to keep pace with staff SWEs in different domains. To a point, it's gotten me higher up the ladder than people with narrower skillsets. But you'll always hit a ceiling as a generalist somewhere. Build the skills you need to get the job done well, but before you build something, figure out if the company should hire another person to do the thing you're thinking of. If you don't foresee a role being in charge of that thing, and that being an intelligent hire, you've gone too deep down a rabbit hole.
|
# ? Sep 13, 2023 17:44 |
|
luminalflux posted:I haven't started looking into OTEL until fairly recently and then only in Go. We've been instrumenting using ddtrace for over 6 years now. I added the hook to comment about 3 years ago when OTEL was a lot rougher. But it's cool that someone added that to OTEL.
|
# ? Sep 13, 2023 17:57 |
|
Terraform fork officially joins the Linux Foundation. Also for some reason it's now "OpenTofu" which is a really stupid sounding name but whatever https://www.linuxfoundation.org/press/announcing-opentofu
|
# ? Sep 20, 2023 17:50 |
|
I have a best practices question. Simplifying somewhat, I have Service A that creates a workflow in Argo Workflows, which will launch Service B. We include the trace context as part of the argo message, Service B retrieves that context and uses it for all future spans. Service B may then invoke Service C and Service D, and we use the Propagation API to extract and inject that context for downstream services. There's a piece of metadata that we want to include that comes from the argo message, the correlation_id . Ideally, we'd include that as a span attribute for all downstream services. What is an effective way to do so? We could use baggage, which is included in Context Propagation, but baggage does not automatically create attributes in child spans, and it must explicitly be retrieved instead. Put generally, is there a way to automatically set attributes for all future child spans when passing context between services? Or am I misunderstanding the baggage API? I'm looking at https://docs.honeycomb.io/getting-data-in/opentelemetry/python-distro/#multi-span-attributes and the https://opentelemetry.io/docs/specs/otel/overview/#baggage-signal and finding myself a bit confused.
|
# ? Sep 20, 2023 18:25 |
|
The Iron Rose posted:I have a best practices question. Simplifying somewhat, I have Service A that creates a workflow in Argo Workflows, which will launch Service B. We include the trace context as part of the argo message, Service B retrieves that context and uses it for all future spans. Service B may then invoke Service C and Service D, and we use the Propagation API to extract and inject that context for downstream services. I had this exact problem with OTEL/baggage earlier this year. No idea why it doesn't work this way out of the box as you'll find lots of people asking the same question, but you have to implement it yourself using a custom BatchSpanProcessor (which isn't that hard it turns out): code:
|
# ? Sep 20, 2023 19:34 |
|
OpenTofu shortens to recognizable variations like OTF or OpenTF while still sidestepping any trademark issues by just calling it "TF" the hashicorp logo for terraform is defensibly a stylized TF. Also opentofu is shorter to write than "terraform" by one character and tofu is many chars shorter. Plus now we have a much better theme for opentofu related projects with a billion soy related naming conventions people can play off. I'm fine with it.
|
# ? Sep 20, 2023 20:14 |
|
can't wait for the opentofu / chef collaboration
|
# ? Sep 21, 2023 10:04 |
|
vanity slug posted:can't wait for the opentofu / chef collaboration
|
# ? Sep 21, 2023 12:28 |
|
Getting ready to s/terragrunt/soyboy/g all my repositories
|
# ? Sep 25, 2023 04:00 |
|
I'm already cybersquatting that domain get on my level
|
# ? Sep 25, 2023 10:44 |
|
Hi, I'm looking for any suggestions for tools like Pagerduty, zenduty, etc. We are a small team (the on-call rotation is 4) standardizing incident management for the first time. Notably, we have a somewhat atypical use case: besides incidents triggered by Prometheus alerts, we also have manually raised incidents. To simplify quite a bit, our product is a glorified ETL system, and we have internal operators setting up the inputs and sometimes something unexpected happens. Despite building lots of internal tooling to give feedback to the operators so they can correct it (e.g. by adding a step in the pipeline), sometimes they need help and raise an incident to engineering. These are the incidents we are most interested in tracking, to keep metrics about the frequency, attribution (e.g. bad input, operator error, bug), time to resolve etc. We are now in a trial of zenduty and it seems fine, we liked it because you mostly manage the incidents in Slack which is how we do it now and is more appealing than a totally new external platform. That said, it's not super polished. But the big issue is that there isn't first-class support for manually raising incidents (and giving feedback to the one who raised it) without adding them into the platform. We would have to build a Slackbot for that. We will consider that approach, but in the meantime I thought I would ask if pagerduty or any other alternative (google brought up "ilert" for example) might be better suited for manually raised incidents. Or are we thinking of the wrong type of tool and we want a helpdesk app instead? Any recommendations welcome. And to preempt, of course most of those operators work can be automated; the team handling on-call is exactly the team responsible for that. But there is some data entry required in each pipeline so there will always be an internal operator involved -- we only try to reduce their work and increase the success rate.
|
# ? Oct 2, 2023 13:10 |
|
This sounds like a process issue not monitoring and alerting issue. I would push your ops manager to design a change control process to handle this Something like three outcomes 1. ETL does not fail. No ticket 2. a. ETL fails 2. b. operator adds extra step 2. c. ETL works again. File ticket in JIRA to document change for change control purposes -> autoclose 3. a. ETL fails 3. b. operator adds extra step 3. c. ETL fails again. Operator files ticket in JIRA to escalate. This sends an email to ops manager and eng manager + director 3. d. JIRA esclalation ticket is triaged by X party (ops and eng manager?) and assigned 3. e. Engineer fixes ETL, sends to operator 3. f. ETL works again. operator closes ticket why did you not choose pager duty? price?
|
# ? Oct 2, 2023 19:22 |
|
Yeah pager duty is the gold standard and unless it is cost prohibitive I would look there first. You can easily manually raise incidents by email or api or whatever you want. It has integrations with literally everything, including slack. It’s an excellent product and I would want a good reason before looking at anything else E: removed some stuff after rereading OP
|
# ? Oct 2, 2023 20:07 |
|
Docjowles posted:Yeah pager duty is the gold standard and unless it is cost prohibitive I would look there first. You can easily manually raise incidents by email or api or whatever you want. It has integrations with literally everything, including slack. It’s an excellent product and I would want a good reason before looking at anything else Absolutely PagerDuty is a great product. I've used it once or twice and also used Big Tech replacements, which tend to be...fine, compared to PD at best. A single outage could cost you more than a few years of PD based on their pricing plans (depending on impact/etc), so I'd definitely recommend it. If there's no financial impact to these issues or the cost is prohibitive otherwise then Hadlock might be right; this might something you can fix via process.
|
# ? Oct 2, 2023 20:55 |
|
Yeah it costs $100k/hr hard cost in revenue loss for us to be down with a MTTR of 45 min; plus the soft costs of ripping engineers away from whatever the hell they were doing that day. $35/mo/engineer is not even a rounding error. If our paging service was down for 15 minutes that just cost us $25k
|
# ? Oct 2, 2023 23:44 |
|
Since everyone mentioned it, yeah we'll do a Pagerduty trial too, I don't recall why we chose to try Zenduty first but those are the two on our list to try. But now that we're trying it I'm wondering if we have the right type of tool for our use case in the first place. For 4 uses pricing isn't really an issue as long as we only pay for responders.Hadlock posted:This sounds like a process issue not monitoring and alerting issue. I would push your ops manager to design a change control process to handle this Yeah, our ops process could definitely be better systematized, although this is pretty much what we're doing, just replace the JIRA ticket with a message tagging a Slack group with rotating membership, and we discuss in the thread. But you're right, it's not necessarily a monitoring and alerting issue, more about the process and the communication. The key things we need are call routing and the ability to apply metadata (tags or categories) to incidents in order to track metric of the process -- nothing crazy. We thought about a ticket-based system but those all seem too async for the number of back-and-forths we usually take and the speed with which we are typically resolving the problems. Although having typed that out, I'm wondering if we missed a whole category of product -- helpdesk with live chat. We use Gitlab issues for engineering work items which would be too async for incidents, but for sure there are chat-based support platforms that might make sense for us. Then we may still need pagerduty/zenduty for the call routing, so it starts to get complicated but anyway any one of these pieces will be an incremental improvement. If we stick with Pagerduty/Zenduty then it looks like we need to implement a Slack bot against the incident management API. Which isn't a bad idea as we can also start to incorporate some of the manual steps into Slack commands/buttons (against our own API), something we've been planning. But my main question coming here was to look for any option where we don't have to build a Slackbot in order to support manually triggered incidents and provide ongoing feedback back to the one who triggered it. Which yeah... that's a helpdesk not an incident management platform, right?
|
# ? Oct 3, 2023 00:52 |
|
Some minor notes since I'm on my phone: Help desk is a business group that helps users with their computers, not a product category. I think you're thinking of a ticketing system instead. If you're already using Jira there might be integrations available for PD but I've never used Jira so no idea.
|
# ? Oct 3, 2023 03:47 |
|
SurgicalOntologist posted:. These are the incidents we are most interested in tracking, to keep metrics about the frequency, attribution (e.g. bad input, operator error, bug), time to resolve etc. Jira has an entire suite of analytics/graph stuff you can do all kinds of analytics on this stuff. Pager duty can shoot off an API request to both JIRA and slack, but IMO it's better to have pager duty generate a ticket, and then have JIRA send a notification to slack. Slack is super cool but IMO it's a black hole for information. If you want to track metrics on process you need to document it in a database -> that is JIRA. People have a tendency to try and cram everything into slack. As they say, you can't manage what you can't measure, and slack is an amorphous blob of random garbage.
|
# ? Oct 3, 2023 04:37 |
|
Also if you subscribe to Jira cloud you get opsgenie, Atlassian's pagerduty for free.
|
# ? Oct 3, 2023 13:26 |
|
Mustache Ride posted:Also if you subscribe to Jira cloud you get opsgenie, Atlassian's pagerduty for free. Yeah, that's great if you're okay with your incident management system being unavailable for over a week.
|
# ? Oct 3, 2023 13:48 |
|
vanity slug posted:Yeah, that's great if you're okay with your incident management system being unavailable for over a week.
|
# ? Oct 3, 2023 17:31 |
|
Trapick posted:If it prevents me getting a 2am call yes absolutely thank you. You just get a call manually from your boss instead and he's going to be a lot grumpier than the pager would be.
|
# ? Oct 3, 2023 23:29 |
|
I didn't hate ops genie when I had to use it like 6 years ago
|
# ? Oct 4, 2023 02:01 |
|
Hadlock posted:I didn't hate ops genie when I had to use it like 6 years ago Yea it’s fine
|
# ? Oct 4, 2023 02:02 |
|
Thanks all. Yeah, looks like what we really need is a ticketing system, then yes maybe pagerduty for managing the schedule and call routing / notifications to the right person but for now the ticketing is the critical thing. We use Gitlab issues for dev ticketing (although we kind of hate it and the product team wants to move to linear.app), but either way we're not bringing the operations team into Gitlab. There's a service desk feature, looks kind of lovely but maybe worth trying...or maybe we'll try something like Zendesk. Anyways, getting kind of afield from the thread topic, thanks for pointing us in the right direction.
|
# ? Oct 4, 2023 20:26 |
|
Morbidly curious, but how does your product team collaborate with engineering on new products and features without a ticketing system
|
# ? Oct 4, 2023 22:06 |
|
SurgicalOntologist posted:Thanks all. Yeah, looks like what we really need is a ticketing system, then yes maybe pagerduty for managing the schedule and call routing / notifications to the right person but for now the ticketing is the critical thing. Yeah Gitlab really wants to be your one stop shop for everything at all related to software development and operations. But it's not realistic for most companies that have been around for a while. If nothing else, freaking everyone has Jira for ticketing and sprint planning and Confluence or some other wiki for documentation and runbooks. But Gitlab is pulling a "well we have all these features, and you gotta pay for them whether you want em or not" business model. Rather than breaking them into separate products and letting you buy what you want a la carte. The pricing has gotten outlandish enough that we're looking at moving to GitHub despite it being a huge undertaking.
|
# ? Oct 4, 2023 23:02 |
|
I still can't make MR X depend on MR Y, which depends on MR Z. Can't choose whether to expand multi-line scripts in GitLab CI either. But fortunately, there's now a webhook for emoji reactions.
|
# ? Oct 4, 2023 23:11 |
|
Docjowles posted:But Gitlab is pulling a "well we have all these features, and you gotta pay for them whether you want em or not" business model. Rather than breaking them into separate products and letting you buy what you want a la carte. The pricing has gotten outlandish enough that we're looking at moving to GitHub despite it being a huge undertaking. TBF Github has been doing this since the microsoft aquesition. Github Actions has really made their service very sticky. They have wikis and static site hosting and some other stuff too.
|
# ? Oct 4, 2023 23:18 |
|
We moved off BitBucket/CircleCI earlier this year to GitHub; it was a major event. I can not tell you how much happier the entire organization is afterwards. GHA has it's oddities, but they're constantly improving it and there are some pretty nice and simple patterns for reusability for workflows.
|
# ? Oct 5, 2023 01:48 |
vanity slug posted:I still can't make MR X depend on MR Y, which depends on MR Z. Can't choose whether to expand multi-line scripts in GitLab CI either. But fortunately, there's now a webhook for emoji reactions. I want that first one, as well as being able to handle the "merge this into the oldest maintained release branch and all branches in between" e.g. My critical security fix is targeting release/1.x and also needs to be merged into release/2.x and develop Currently we just do the MR for the oldest branch, then the merge to the other branches happens without an MR. Would be nice to just have it all contained within a single MR though (with an option for empty merge for changes we don't want to merge forward)
|
|
# ? Oct 5, 2023 02:21 |
|
Hadlock posted:Morbidly curious, but how does your product team collaborate with engineering on new products and features without a ticketing system I think we had a misunderstanding somewhere along the line. We use GitLab issues/epics/milestones etc. If that doesn't count as a ticketing system, I'm confused. It doesn't suit our needs for supporting our operations team because they're not in GitLab, plus the communication is too slow/async. If we end up building our own "support frontend" that hits the API of one or more services then sure we could tailor it to our needs, but I'd rather find something better suited out of the box (and then yes connect it to GL issues). Docjowles posted:Yeah Gitlab really wants to be your one stop shop for everything at all related to software development and operations. But it's not realistic for most companies that have been around for a while. If nothing else, freaking everyone has Jira for ticketing and sprint planning and Confluence or some other wiki for documentation and runbooks. But Gitlab is pulling a "well we have all these features, and you gotta pay for them whether you want em or not" business model. Rather than breaking them into separate products and letting you buy what you want a la carte. The pricing has gotten outlandish enough that we're looking at moving to GitHub despite it being a huge undertaking. Yeah same. Trying to be less and less dependent on them. Not only the pricing changes, but they've been glacial at improving obvious gaps in the product. I'm probably following a dozen issues on their tracker that at some point looked promising but most have been silent for years. They try to be everything and it's all half-baked, there's almost always a better choice than adopting one of their features even considering we already pay for it.
|
# ? Oct 5, 2023 08:59 |
|
SurgicalOntologist posted:I think we had a misunderstanding somewhere along the line. We use GitLab issues/epics/milestones etc. If that doesn't count as a ticketing system, I'm confused. Apparently your system works for you I just deleted two paragraphs but the gist is I'm getting the impression that your product and engineering orgs have a long ways to go to reach true operational maturity. Good luck Maybe gitlab has a true JIRA analog these days that is totally decoupled from repos for the product team? That would go a long ways probably
|
# ? Oct 5, 2023 09:52 |
|
SurgicalOntologist posted:I think we had a misunderstanding somewhere along the line. We use GitLab issues/epics/milestones etc. If that doesn't count as a ticketing system, I'm confused. Heh yeah it’s become a meme with us that any time we request a feature or bug fix our account rep just posts a “+1 enterprise customer with XYZ seats requests this” on an issue from 2018 and nothing is ever heard from them again. It’s totally performative I respect the honesty of companies like Amazon that respond to attempts at hardball with “lol come back when you have 3 more zeroes on your monthly bill tough guy”
|
# ? Oct 5, 2023 10:16 |
|
|
# ? May 21, 2024 18:23 |
|
Hadlock posted:Apparently your system works for you I mean it "works" for us to an extent, but like I said we're not happy and trying to move away from Gitlab, actively for product management if only passively for everything else. And yes absolutely we're not at operational maturity. It lacks a lot of things to be a true JIRA analog, they're clearly pulling in too many directions. There's some basic structure you can do with Epics, Milestones, and Iterations but most of what you want you need to setup with labels and figure out your own system for turning that into a workflow and using the labels to structure your boards and metrics. But with any moderately involved system it just becomes label soup. Just check the "child issues" here and you'll see what I mean -- and this is from GitLab's own dogfooding, which doesn't exactly inspire confidence: https://gitlab.com/groups/gitlab-org/-/epics/328 For us, running a basic kanban process with each team having a board for issues and a global board for epics, using milestones for longer cycles, its been OK but we're outgrowing it. It's probably relevant context that our "product org" is... 2 people? And we have 12 engineers, that's if you count 3 students we support in their PhDs that are doing longer-term R&D. But I mean, yeah, to answer your previous question, we do communicate through tickets (issues) even if we don't have some intricate JIRA setup.
|
# ? Oct 5, 2023 12:42 |