Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

luminalflux posted:

We use SQLAlchemy too, it's fairly easy with sqlalchemy.events.ConnectionEvents.before_cursor_execute to hook the cursor event and modify the query. We've done this with Datadog, where we actually kick open a new span inside this hook, add some tags we glean from various context handlers, and write the span and trace ID inside the comment as well as the callsite. This makes it easy to look at the process list, find the trace and span, but also see which goddamn file it's coming from.
Do you find this more helpful than using opentelemetry-instrumentation-sqlalchemy and just letting OTel sort it out?

Adbot
ADBOT LOVES YOU

luminalflux
May 27, 2005



Vulture Culture posted:

Do you find this more helpful than using opentelemetry-instrumentation-sqlalchemy and just letting OTel sort it out?

I haven't started looking into OTEL until fairly recently and then only in Go. We've been instrumenting using ddtrace for over 6 years now. I added the hook to comment about 3 years ago when OTEL was a lot rougher. But it's cool that someone added that to OTEL.

edit: does opentelemetry-instrumentation-sqlalchemy add callsites? since we get a huge amount of value out of that (prob even more than the traces) since you can just go to the file

Necronomicon
Jan 18, 2004

Does anyone have some super baseline monitoring / alerting advice for a handful of lightweight ECS clusters? I have four clusters across two AWS accounts with no more than 8 services running on each. This is a fairly simple product without a lot of moving parts and all of the services are outputting logs into CloudWatch. I figure AWS-native tools will be just fine and I don't need to get anywhere near DataDog.

In re: discussion about how much programming one should be able to do as a DevOps engineer, I've written very little code for the last several years (outside of a bunch of Terraform). I think I've been able to succeed by combination of being a generalist and having a decent amount of soft skills. I do see it as an impediment to my future, but it's hard to find time to take like, a Go course if I don't have time for it after work since there's a baby to be responsible for now.

The Fool
Oct 16, 2003


golang would be a waste of time, but python is a must

The Fool
Oct 16, 2003


But even then you're not building full applications. The most complicated stuff I make is still essentially gluing apis together in a giant rube Goldberg machine

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.
If you don't own your whole environment and you need to interoperate with things other people are building, that you don't control, you need to have sufficient data literacy and analysis skills to understand small details of your environment at scale. This is the key to migrations, and migrations are the key to actually bringing operational cost down by consolidating and automating platforms. You can lean on talented SWEs in your organization, but this makes ad-hoc exploratory analysis harder, and to make it work, you need to have a great idea up front of what hypotheses you're trying to invalidate.

I've solved a lot of interesting problems by being a generalist expert in a dozen completely separate things, and being able to keep pace with staff SWEs in different domains. To a point, it's gotten me higher up the ladder than people with narrower skillsets. But you'll always hit a ceiling as a generalist somewhere. Build the skills you need to get the job done well, but before you build something, figure out if the company should hire another person to do the thing you're thinking of. If you don't foresee a role being in charge of that thing, and that being an intelligent hire, you've gone too deep down a rabbit hole.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

luminalflux posted:

I haven't started looking into OTEL until fairly recently and then only in Go. We've been instrumenting using ddtrace for over 6 years now. I added the hook to comment about 3 years ago when OTEL was a lot rougher. But it's cool that someone added that to OTEL.

edit: does opentelemetry-instrumentation-sqlalchemy add callsites? since we get a huge amount of value out of that (prob even more than the traces) since you can just go to the file
I haven't gotten to testing that part yet, so I'm not positive. It is in OTel-instrumented logs when using opentelemetry-instrumentation-logging.

Docjowles
Apr 9, 2009

Terraform fork officially joins the Linux Foundation. Also for some reason it's now "OpenTofu" which is a really stupid sounding name but whatever

https://www.linuxfoundation.org/press/announcing-opentofu

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
I have a best practices question. Simplifying somewhat, I have Service A that creates a workflow in Argo Workflows, which will launch Service B. We include the trace context as part of the argo message, Service B retrieves that context and uses it for all future spans. Service B may then invoke Service C and Service D, and we use the Propagation API to extract and inject that context for downstream services.

There's a piece of metadata that we want to include that comes from the argo message, the correlation_id . Ideally, we'd include that as a span attribute for all downstream services. What is an effective way to do so? We could use baggage, which is included in Context Propagation, but baggage does not automatically create attributes in child spans, and it must explicitly be retrieved instead.

Put generally, is there a way to automatically set attributes for all future child spans when passing context between services? Or am I misunderstanding the baggage API?

I'm looking at https://docs.honeycomb.io/getting-data-in/opentelemetry/python-distro/#multi-span-attributes and the https://opentelemetry.io/docs/specs/otel/overview/#baggage-signal and finding myself a bit confused.

TheJanitor
Apr 17, 2007
Ask me about being the strongest janitor since Roger Wilco

The Iron Rose posted:

I have a best practices question. Simplifying somewhat, I have Service A that creates a workflow in Argo Workflows, which will launch Service B. We include the trace context as part of the argo message, Service B retrieves that context and uses it for all future spans. Service B may then invoke Service C and Service D, and we use the Propagation API to extract and inject that context for downstream services.

There's a piece of metadata that we want to include that comes from the argo message, the correlation_id . Ideally, we'd include that as a span attribute for all downstream services. What is an effective way to do so? We could use baggage, which is included in Context Propagation, but baggage does not automatically create attributes in child spans, and it must explicitly be retrieved instead.

Put generally, is there a way to automatically set attributes for all future child spans when passing context between services? Or am I misunderstanding the baggage API?

I'm looking at https://docs.honeycomb.io/getting-data-in/opentelemetry/python-distro/#multi-span-attributes and the https://opentelemetry.io/docs/specs/otel/overview/#baggage-signal and finding myself a bit confused.

I had this exact problem with OTEL/baggage earlier this year. No idea why it doesn't work this way out of the box as you'll find lots of people asking the same question, but you have to implement it yourself using a custom BatchSpanProcessor (which isn't that hard it turns out):

code:
class BatchBaggageSpanProcessor(BatchSpanProcessor):
    def on_start(
        self, span: Span, parent_context: typing.Optional[Context] = None
    ) -> None:
        super().on_start(span, parent_context)
        get_all = baggage.get_all(context=parent_context)
        for name, value in get_all.items():
            span.set_attribute(name, value)

...

# Whereever you setup your OTEL tracer stuff
processor = BatchBaggageSpanProcessor(OTLPSpanExporter())
provider.add_span_processor(processor)

Hadlock
Nov 9, 2004

OpenTofu shortens to recognizable variations like OTF or OpenTF while still sidestepping any trademark issues by just calling it "TF" the hashicorp logo for terraform is defensibly a stylized TF. Also opentofu is shorter to write than "terraform" by one character and tofu is many chars shorter. Plus now we have a much better theme for opentofu related projects with a billion soy related naming conventions people can play off. I'm fine with it.

vanity slug
Jul 20, 2010

can't wait for the opentofu / chef collaboration

Sagacity
May 2, 2003
Hopefully my epitaph will be funnier than my custom title.

vanity slug posted:

can't wait for the opentofu / chef collaboration
i'm sure we can fit an "omakase" reference in somewhere

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.
Getting ready to s/terragrunt/soyboy/g all my repositories

Hadlock
Nov 9, 2004

I'm already cybersquatting that domain get on my level :synpa:

SurgicalOntologist
Jun 17, 2004

Hi, I'm looking for any suggestions for tools like Pagerduty, zenduty, etc. We are a small team (the on-call rotation is 4) standardizing incident management for the first time. Notably, we have a somewhat atypical use case: besides incidents triggered by Prometheus alerts, we also have manually raised incidents. To simplify quite a bit, our product is a glorified ETL system, and we have internal operators setting up the inputs and sometimes something unexpected happens. Despite building lots of internal tooling to give feedback to the operators so they can correct it (e.g. by adding a step in the pipeline), sometimes they need help and raise an incident to engineering. These are the incidents we are most interested in tracking, to keep metrics about the frequency, attribution (e.g. bad input, operator error, bug), time to resolve etc.

We are now in a trial of zenduty and it seems fine, we liked it because you mostly manage the incidents in Slack which is how we do it now and is more appealing than a totally new external platform. That said, it's not super polished. But the big issue is that there isn't first-class support for manually raising incidents (and giving feedback to the one who raised it) without adding them into the platform. We would have to build a Slackbot for that. We will consider that approach, but in the meantime I thought I would ask if pagerduty or any other alternative (google brought up "ilert" for example) might be better suited for manually raised incidents. Or are we thinking of the wrong type of tool and we want a helpdesk app instead? Any recommendations welcome.

And to preempt, of course most of those operators work can be automated; the team handling on-call is exactly the team responsible for that. But there is some data entry required in each pipeline so there will always be an internal operator involved -- we only try to reduce their work and increase the success rate.

Hadlock
Nov 9, 2004

This sounds like a process issue not monitoring and alerting issue. I would push your ops manager to design a change control process to handle this

Something like three outcomes

1. ETL does not fail. No ticket

2. a. ETL fails
2. b. operator adds extra step
2. c. ETL works again. File ticket in JIRA to document change for change control purposes -> autoclose

3. a. ETL fails
3. b. operator adds extra step
3. c. ETL fails again. Operator files ticket in JIRA to escalate. This sends an email to ops manager and eng manager + director
3. d. JIRA esclalation ticket is triaged by X party (ops and eng manager?) and assigned
3. e. Engineer fixes ETL, sends to operator
3. f. ETL works again. operator closes ticket

why did you not choose pager duty? price?

Docjowles
Apr 9, 2009

Yeah pager duty is the gold standard and unless it is cost prohibitive I would look there first. You can easily manually raise incidents by email or api or whatever you want. It has integrations with literally everything, including slack. It’s an excellent product and I would want a good reason before looking at anything else

E: removed some stuff after rereading OP

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Docjowles posted:

Yeah pager duty is the gold standard and unless it is cost prohibitive I would look there first. You can easily manually raise incidents by email or api or whatever you want. It has integrations with literally everything, including slack. It’s an excellent product and I would want a good reason before looking at anything else

E: removed some stuff after rereading OP

Absolutely PagerDuty is a great product. I've used it once or twice and also used Big Tech replacements, which tend to be...fine, compared to PD at best. A single outage could cost you more than a few years of PD based on their pricing plans (depending on impact/etc), so I'd definitely recommend it.

If there's no financial impact to these issues or the cost is prohibitive otherwise then Hadlock might be right; this might something you can fix via process.

Hadlock
Nov 9, 2004

Yeah it costs $100k/hr hard cost in revenue loss for us to be down with a MTTR of 45 min; plus the soft costs of ripping engineers away from whatever the hell they were doing that day. $35/mo/engineer is not even a rounding error. If our paging service was down for 15 minutes that just cost us $25k

SurgicalOntologist
Jun 17, 2004

Since everyone mentioned it, yeah we'll do a Pagerduty trial too, I don't recall why we chose to try Zenduty first but those are the two on our list to try. But now that we're trying it I'm wondering if we have the right type of tool for our use case in the first place. For 4 uses pricing isn't really an issue as long as we only pay for responders.

Hadlock posted:

This sounds like a process issue not monitoring and alerting issue. I would push your ops manager to design a change control process to handle this

Something like three outcomes

1. ETL does not fail. No ticket

2. a. ETL fails
2. b. operator adds extra step
2. c. ETL works again. File ticket in JIRA to document change for change control purposes -> autoclose

3. a. ETL fails
3. b. operator adds extra step
3. c. ETL fails again. Operator files ticket in JIRA to escalate. This sends an email to ops manager and eng manager + director
3. d. JIRA esclalation ticket is triaged by X party (ops and eng manager?) and assigned
3. e. Engineer fixes ETL, sends to operator
3. f. ETL works again. operator closes ticket

Yeah, our ops process could definitely be better systematized, although this is pretty much what we're doing, just replace the JIRA ticket with a message tagging a Slack group with rotating membership, and we discuss in the thread.

But you're right, it's not necessarily a monitoring and alerting issue, more about the process and the communication. The key things we need are call routing and the ability to apply metadata (tags or categories) to incidents in order to track metric of the process -- nothing crazy.

We thought about a ticket-based system but those all seem too async for the number of back-and-forths we usually take and the speed with which we are typically resolving the problems. Although having typed that out, I'm wondering if we missed a whole category of product -- helpdesk with live chat. We use Gitlab issues for engineering work items which would be too async for incidents, but for sure there are chat-based support platforms that might make sense for us. Then we may still need pagerduty/zenduty for the call routing, so it starts to get complicated but anyway any one of these pieces will be an incremental improvement.

If we stick with Pagerduty/Zenduty then it looks like we need to implement a Slack bot against the incident management API. Which isn't a bad idea as we can also start to incorporate some of the manual steps into Slack commands/buttons (against our own API), something we've been planning. But my main question coming here was to look for any option where we don't have to build a Slackbot in order to support manually triggered incidents and provide ongoing feedback back to the one who triggered it. Which yeah... that's a helpdesk not an incident management platform, right?

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
Some minor notes since I'm on my phone:

Help desk is a business group that helps users with their computers, not a product category. I think you're thinking of a ticketing system instead. If you're already using Jira there might be integrations available for PD but I've never used Jira so no idea.

Hadlock
Nov 9, 2004

SurgicalOntologist posted:

. These are the incidents we are most interested in tracking, to keep metrics about the frequency, attribution (e.g. bad input, operator error, bug), time to resolve etc.

Jira has an entire suite of analytics/graph stuff you can do all kinds of analytics on this stuff. Pager duty can shoot off an API request to both JIRA and slack, but IMO it's better to have pager duty generate a ticket, and then have JIRA send a notification to slack.

Slack is super cool but IMO it's a black hole for information. If you want to track metrics on process you need to document it in a database -> that is JIRA. People have a tendency to try and cram everything into slack. As they say, you can't manage what you can't measure, and slack is an amorphous blob of random garbage.

Mustache Ride
Sep 11, 2001



Also if you subscribe to Jira cloud you get opsgenie, Atlassian's pagerduty for free.

vanity slug
Jul 20, 2010

Mustache Ride posted:

Also if you subscribe to Jira cloud you get opsgenie, Atlassian's pagerduty for free.

Yeah, that's great if you're okay with your incident management system being unavailable for over a week.

Trapick
Apr 17, 2006

vanity slug posted:

Yeah, that's great if you're okay with your incident management system being unavailable for over a week.
If it prevents me getting a 2am call yes absolutely thank you.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Trapick posted:

If it prevents me getting a 2am call yes absolutely thank you.

You just get a call manually from your boss instead and he's going to be a lot grumpier than the pager would be.

Hadlock
Nov 9, 2004

I didn't hate ops genie when I had to use it like 6 years ago

Junkiebev
Jan 18, 2002


Feel the progress.

Hadlock posted:

I didn't hate ops genie when I had to use it like 6 years ago

Yea it’s fine

SurgicalOntologist
Jun 17, 2004

Thanks all. Yeah, looks like what we really need is a ticketing system, then yes maybe pagerduty for managing the schedule and call routing / notifications to the right person but for now the ticketing is the critical thing.

We use Gitlab issues for dev ticketing (although we kind of hate it and the product team wants to move to linear.app), but either way we're not bringing the operations team into Gitlab. There's a service desk feature, looks kind of lovely but maybe worth trying...or maybe we'll try something like Zendesk.

Anyways, getting kind of afield from the thread topic, thanks for pointing us in the right direction.

Hadlock
Nov 9, 2004

Morbidly curious, but how does your product team collaborate with engineering on new products and features without a ticketing system

Docjowles
Apr 9, 2009

SurgicalOntologist posted:

Thanks all. Yeah, looks like what we really need is a ticketing system, then yes maybe pagerduty for managing the schedule and call routing / notifications to the right person but for now the ticketing is the critical thing.

We use Gitlab issues for dev ticketing (although we kind of hate it and the product team wants to move to linear.app), but either way we're not bringing the operations team into Gitlab. There's a service desk feature, looks kind of lovely but maybe worth trying...or maybe we'll try something like Zendesk.

Anyways, getting kind of afield from the thread topic, thanks for pointing us in the right direction.

Yeah Gitlab really wants to be your one stop shop for everything at all related to software development and operations. But it's not realistic for most companies that have been around for a while. If nothing else, freaking everyone has Jira for ticketing and sprint planning and Confluence or some other wiki for documentation and runbooks. But Gitlab is pulling a "well we have all these features, and you gotta pay for them whether you want em or not" business model. Rather than breaking them into separate products and letting you buy what you want a la carte. The pricing has gotten outlandish enough that we're looking at moving to GitHub despite it being a huge undertaking.

vanity slug
Jul 20, 2010

I still can't make MR X depend on MR Y, which depends on MR Z. Can't choose whether to expand multi-line scripts in GitLab CI either. But fortunately, there's now a webhook for emoji reactions.

Hadlock
Nov 9, 2004

Docjowles posted:

But Gitlab is pulling a "well we have all these features, and you gotta pay for them whether you want em or not" business model. Rather than breaking them into separate products and letting you buy what you want a la carte. The pricing has gotten outlandish enough that we're looking at moving to GitHub despite it being a huge undertaking.

TBF Github has been doing this since the microsoft aquesition. Github Actions has really made their service very sticky. They have wikis and static site hosting and some other stuff too.

drunk mutt
Jul 5, 2011

I just think they're neat
We moved off BitBucket/CircleCI earlier this year to GitHub; it was a major event.

I can not tell you how much happier the entire organization is afterwards. GHA has it's oddities, but they're constantly improving it and there are some pretty nice and simple patterns for reusability for workflows.

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb

vanity slug posted:

I still can't make MR X depend on MR Y, which depends on MR Z. Can't choose whether to expand multi-line scripts in GitLab CI either. But fortunately, there's now a webhook for emoji reactions.

I want that first one, as well as being able to handle the "merge this into the oldest maintained release branch and all branches in between"

e.g. My critical security fix is targeting release/1.x and also needs to be merged into release/2.x and develop

Currently we just do the MR for the oldest branch, then the merge to the other branches happens without an MR. Would be nice to just have it all contained within a single MR though (with an option for empty merge for changes we don't want to merge forward)

SurgicalOntologist
Jun 17, 2004

Hadlock posted:

Morbidly curious, but how does your product team collaborate with engineering on new products and features without a ticketing system

I think we had a misunderstanding somewhere along the line. We use GitLab issues/epics/milestones etc. If that doesn't count as a ticketing system, I'm confused.

It doesn't suit our needs for supporting our operations team because they're not in GitLab, plus the communication is too slow/async. If we end up building our own "support frontend" that hits the API of one or more services then sure we could tailor it to our needs, but I'd rather find something better suited out of the box (and then yes connect it to GL issues).


Docjowles posted:

Yeah Gitlab really wants to be your one stop shop for everything at all related to software development and operations. But it's not realistic for most companies that have been around for a while. If nothing else, freaking everyone has Jira for ticketing and sprint planning and Confluence or some other wiki for documentation and runbooks. But Gitlab is pulling a "well we have all these features, and you gotta pay for them whether you want em or not" business model. Rather than breaking them into separate products and letting you buy what you want a la carte. The pricing has gotten outlandish enough that we're looking at moving to GitHub despite it being a huge undertaking.

Yeah same. Trying to be less and less dependent on them. Not only the pricing changes, but they've been glacial at improving obvious gaps in the product. I'm probably following a dozen issues on their tracker that at some point looked promising but most have been silent for years. They try to be everything and it's all half-baked, there's almost always a better choice than adopting one of their features even considering we already pay for it.

Hadlock
Nov 9, 2004

SurgicalOntologist posted:

I think we had a misunderstanding somewhere along the line. We use GitLab issues/epics/milestones etc. If that doesn't count as a ticketing system, I'm confused.

Apparently your system works for you

I just deleted two paragraphs but the gist is I'm getting the impression that your product and engineering orgs have a long ways to go to reach true operational maturity. Good luck

Maybe gitlab has a true JIRA analog these days that is totally decoupled from repos for the product team? That would go a long ways probably

Docjowles
Apr 9, 2009

SurgicalOntologist posted:

I think we had a misunderstanding somewhere along the line. We use GitLab issues/epics/milestones etc. If that doesn't count as a ticketing system, I'm confused.

It doesn't suit our needs for supporting our operations team because they're not in GitLab, plus the communication is too slow/async. If we end up building our own "support frontend" that hits the API of one or more services then sure we could tailor it to our needs, but I'd rather find something better suited out of the box (and then yes connect it to GL issues).

Yeah same. Trying to be less and less dependent on them. Not only the pricing changes, but they've been glacial at improving obvious gaps in the product. I'm probably following a dozen issues on their tracker that at some point looked promising but most have been silent for years. They try to be everything and it's all half-baked, there's almost always a better choice than adopting one of their features even considering we already pay for it.

Heh yeah it’s become a meme with us that any time we request a feature or bug fix our account rep just posts a “+1 enterprise customer with XYZ seats requests this” on an issue from 2018 and nothing is ever heard from them again. It’s totally performative

I respect the honesty of companies like Amazon that respond to attempts at hardball with “lol come back when you have 3 more zeroes on your monthly bill tough guy”

Adbot
ADBOT LOVES YOU

SurgicalOntologist
Jun 17, 2004

Hadlock posted:

Apparently your system works for you

I just deleted two paragraphs but the gist is I'm getting the impression that your product and engineering orgs have a long ways to go to reach true operational maturity. Good luck

Maybe gitlab has a true JIRA analog these days that is totally decoupled from repos for the product team? That would go a long ways probably

I mean it "works" for us to an extent, but like I said we're not happy and trying to move away from Gitlab, actively for product management if only passively for everything else. And yes absolutely we're not at operational maturity.

It lacks a lot of things to be a true JIRA analog, they're clearly pulling in too many directions. There's some basic structure you can do with Epics, Milestones, and Iterations but most of what you want you need to setup with labels and figure out your own system for turning that into a workflow and using the labels to structure your boards and metrics. But with any moderately involved system it just becomes label soup. Just check the "child issues" here and you'll see what I mean -- and this is from GitLab's own dogfooding, which doesn't exactly inspire confidence: https://gitlab.com/groups/gitlab-org/-/epics/328

For us, running a basic kanban process with each team having a board for issues and a global board for epics, using milestones for longer cycles, its been OK but we're outgrowing it. It's probably relevant context that our "product org" is... 2 people? And we have 12 engineers, that's if you count 3 students we support in their PhDs that are doing longer-term R&D.

But I mean, yeah, to answer your previous question, we do communicate through tickets (issues) even if we don't have some intricate JIRA setup.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply