Continuous Integration/build engineering/devops thread

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread

«‹›157 »

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

luminalflux posted:

We use SQLAlchemy too, it's fairly easy with sqlalchemy.events.ConnectionEvents.before_cursor_execute to hook the cursor event and modify the query. We've done this with Datadog, where we actually kick open a new span inside this hook, add some tags we glean from various context handlers, and write the span and trace ID inside the comment as well as the callsite. This makes it easy to look at the process list, find the trace and span, but also see which goddamn file it's coming from.

Do you find this more helpful than using opentelemetry-instrumentation-sqlalchemy and just letting OTel sort it out?

# ? Sep 12, 2023 03:48

Adbot: ADBOT LOVES YOU

# ? May 21, 2024 18:23

luminalflux: May 27, 2005

Vulture Culture posted:

Do you find this more helpful than using opentelemetry-instrumentation-sqlalchemy and just letting OTel sort it out?

I haven't started looking into OTEL until fairly recently and then only in Go. We've been instrumenting using ddtrace for over 6 years now. I added the hook to comment about 3 years ago when OTEL was a lot rougher. But it's cool that someone added that to OTEL.

edit: does opentelemetry-instrumentation-sqlalchemy add callsites? since we get a huge amount of value out of that (prob even more than the traces) since you can just go to the file

# ? Sep 12, 2023 04:05

Necronomicon: Jan 18, 2004

Does anyone have some super baseline monitoring / alerting advice for a handful of lightweight ECS clusters? I have four clusters across two AWS accounts with no more than 8 services running on each. This is a fairly simple product without a lot of moving parts and all of the services are outputting logs into CloudWatch. I figure AWS-native tools will be just fine and I don't need to get anywhere near DataDog.

In re: discussion about how much programming one should be able to do as a DevOps engineer, I've written very little code for the last several years (outside of a bunch of Terraform). I think I've been able to succeed by combination of being a generalist and having a decent amount of soft skills. I do see it as an impediment to my future, but it's hard to find time to take like, a Go course if I don't have time for it after work since there's a baby to be responsible for now.

# ? Sep 13, 2023 15:34

The Fool: Oct 16, 2003

golang would be a waste of time, but python is a must

# ? Sep 13, 2023 15:38

The Fool: Oct 16, 2003

But even then you're not building full applications. The most complicated stuff I make is still essentially gluing apis together in a giant rube Goldberg machine

# ? Sep 13, 2023 15:39

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

If you don't own your whole environment and you need to interoperate with things other people are building, that you don't control, you need to have sufficient data literacy and analysis skills to understand small details of your environment at scale. This is the key to migrations, and migrations are the key to actually bringing operational cost down by consolidating and automating platforms. You can lean on talented SWEs in your organization, but this makes ad-hoc exploratory analysis harder, and to make it work, you need to have a great idea up front of what hypotheses you're trying to invalidate.

I've solved a lot of interesting problems by being a generalist expert in a dozen completely separate things, and being able to keep pace with staff SWEs in different domains. To a point, it's gotten me higher up the ladder than people with narrower skillsets. But you'll always hit a ceiling as a generalist somewhere. Build the skills you need to get the job done well, but before you build something, figure out if the company should hire another person to do the thing you're thinking of. If you don't foresee a role being in charge of that thing, and that being an intelligent hire, you've gone too deep down a rabbit hole.

# ? Sep 13, 2023 17:44

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

luminalflux posted:

I haven't started looking into OTEL until fairly recently and then only in Go. We've been instrumenting using ddtrace for over 6 years now. I added the hook to comment about 3 years ago when OTEL was a lot rougher. But it's cool that someone added that to OTEL.

edit: does opentelemetry-instrumentation-sqlalchemy add callsites? since we get a huge amount of value out of that (prob even more than the traces) since you can just go to the file

I haven't gotten to testing that part yet, so I'm not positive. It is in OTel-instrumented logs when using opentelemetry-instrumentation-logging.

# ? Sep 13, 2023 17:57

Docjowles: Apr 9, 2009

Terraform fork officially joins the Linux Foundation. Also for some reason it's now "OpenTofu" which is a really stupid sounding name but whatever

https://www.linuxfoundation.org/press/announcing-opentofu

# ? Sep 20, 2023 17:50

The Iron Rose: May 12, 2012; Cat Army

I have a best practices question. Simplifying somewhat, I have Service A that creates a workflow in Argo Workflows, which will launch Service B. We include the trace context as part of the argo message, Service B retrieves that context and uses it for all future spans. Service B may then invoke Service C and Service D, and we use the Propagation API to extract and inject that context for downstream services.

There's a piece of metadata that we want to include that comes from the argo message, the correlation_id . Ideally, we'd include that as a span attribute for all downstream services. What is an effective way to do so? We could use baggage, which is included in Context Propagation, but baggage does not automatically create attributes in child spans, and it must explicitly be retrieved instead.

Put generally, is there a way to automatically set attributes for all future child spans when passing context between services? Or am I misunderstanding the baggage API?

I'm looking at https://docs.honeycomb.io/getting-data-in/opentelemetry/python-distro/#multi-span-attributes and the https://opentelemetry.io/docs/specs/otel/overview/#baggage-signal and finding myself a bit confused.

# ? Sep 20, 2023 18:25

TheJanitor: Apr 17, 2007; Ask me about being the strongest janitor since Roger Wilco

The Iron Rose posted:

I have a best practices question. Simplifying somewhat, I have Service A that creates a workflow in Argo Workflows, which will launch Service B. We include the trace context as part of the argo message, Service B retrieves that context and uses it for all future spans. Service B may then invoke Service C and Service D, and we use the Propagation API to extract and inject that context for downstream services.

There's a piece of metadata that we want to include that comes from the argo message, the correlation_id . Ideally, we'd include that as a span attribute for all downstream services. What is an effective way to do so? We could use baggage, which is included in Context Propagation, but baggage does not automatically create attributes in child spans, and it must explicitly be retrieved instead.

Put generally, is there a way to automatically set attributes for all future child spans when passing context between services? Or am I misunderstanding the baggage API?

I'm looking at https://docs.honeycomb.io/getting-data-in/opentelemetry/python-distro/#multi-span-attributes and the https://opentelemetry.io/docs/specs/otel/overview/#baggage-signal and finding myself a bit confused.

I had this exact problem with OTEL/baggage earlier this year. No idea why it doesn't work this way out of the box as you'll find lots of people asking the same question, but you have to implement it yourself using a custom BatchSpanProcessor (which isn't that hard it turns out):

code:

class BatchBaggageSpanProcessor(BatchSpanProcessor):
    def on_start(
        self, span: Span, parent_context: typing.Optional[Context] = None
    ) -> None:
        super().on_start(span, parent_context)
        get_all = baggage.get_all(context=parent_context)
        for name, value in get_all.items():
            span.set_attribute(name, value)

...

# Whereever you setup your OTEL tracer stuff
processor = BatchBaggageSpanProcessor(OTLPSpanExporter())
provider.add_span_processor(processor)

# ? Sep 20, 2023 19:34

Hadlock: Nov 9, 2004

OpenTofu shortens to recognizable variations like OTF or OpenTF while still sidestepping any trademark issues by just calling it "TF" the hashicorp logo for terraform is defensibly a stylized TF. Also opentofu is shorter to write than "terraform" by one character and tofu is many chars shorter. Plus now we have a much better theme for opentofu related projects with a billion soy related naming conventions people can play off. I'm fine with it.

# ? Sep 20, 2023 20:14

vanity slug: Jul 20, 2010

can't wait for the opentofu / chef collaboration

# ? Sep 21, 2023 10:04

Sagacity: May 2, 2003; Hopefully my epitaph will be funnier than my custom title.

vanity slug posted:

can't wait for the opentofu / chef collaboration

i'm sure we can fit an "omakase" reference in somewhere

# ? Sep 21, 2023 12:28

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Getting ready to s/terragrunt/soyboy/g all my repositories

# ? Sep 25, 2023 04:00

Hadlock: Nov 9, 2004

I'm already cybersquatting that domain get on my level :synpa:

# ? Sep 25, 2023 10:44

SurgicalOntologist: Jun 17, 2004

Hi, I'm looking for any suggestions for tools like Pagerduty, zenduty, etc. We are a small team (the on-call rotation is 4) standardizing incident management for the first time. Notably, we have a somewhat atypical use case: besides incidents triggered by Prometheus alerts, we also have manually raised incidents. To simplify quite a bit, our product is a glorified ETL system, and we have internal operators setting up the inputs and sometimes something unexpected happens. Despite building lots of internal tooling to give feedback to the operators so they can correct it (e.g. by adding a step in the pipeline), sometimes they need help and raise an incident to engineering. These are the incidents we are most interested in tracking, to keep metrics about the frequency, attribution (e.g. bad input, operator error, bug), time to resolve etc.

We are now in a trial of zenduty and it seems fine, we liked it because you mostly manage the incidents in Slack which is how we do it now and is more appealing than a totally new external platform. That said, it's not super polished. But the big issue is that there isn't first-class support for manually raising incidents (and giving feedback to the one who raised it) without adding them into the platform. We would have to build a Slackbot for that. We will consider that approach, but in the meantime I thought I would ask if pagerduty or any other alternative (google brought up "ilert" for example) might be better suited for manually raised incidents. Or are we thinking of the wrong type of tool and we want a helpdesk app instead? Any recommendations welcome.

And to preempt, of course most of those operators work can be automated; the team handling on-call is exactly the team responsible for that. But there is some data entry required in each pipeline so there will always be an internal operator involved -- we only try to reduce their work and increase the success rate.

# ? Oct 2, 2023 13:10

Hadlock: Nov 9, 2004

This sounds like a process issue not monitoring and alerting issue. I would push your ops manager to design a change control process to handle this

Something like three outcomes

1. ETL does not fail. No ticket

2. a. ETL fails
2. b. operator adds extra step
2. c. ETL works again. File ticket in JIRA to document change for change control purposes -> autoclose

3. a. ETL fails
3. b. operator adds extra step
3. c. ETL fails again. Operator files ticket in JIRA to escalate. This sends an email to ops manager and eng manager + director
3. d. JIRA esclalation ticket is triaged by X party (ops and eng manager?) and assigned
3. e. Engineer fixes ETL, sends to operator
3. f. ETL works again. operator closes ticket

why did you not choose pager duty? price?

# ? Oct 2, 2023 19:22

Docjowles: Apr 9, 2009

Yeah pager duty is the gold standard and unless it is cost prohibitive I would look there first. You can easily manually raise incidents by email or api or whatever you want. It has integrations with literally everything, including slack. It�s an excellent product and I would want a good reason before looking at anything else

E: removed some stuff after rereading OP

# ? Oct 2, 2023 20:07

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Docjowles posted:

Yeah pager duty is the gold standard and unless it is cost prohibitive I would look there first. You can easily manually raise incidents by email or api or whatever you want. It has integrations with literally everything, including slack. It�s an excellent product and I would want a good reason before looking at anything else

E: removed some stuff after rereading OP

Absolutely PagerDuty is a great product. I've used it once or twice and also used Big Tech replacements, which tend to be...fine, compared to PD at best. A single outage could cost you more than a few years of PD based on their pricing plans (depending on impact/etc), so I'd definitely recommend it.

If there's no financial impact to these issues or the cost is prohibitive otherwise then Hadlock might be right; this might something you can fix via process.

# ? Oct 2, 2023 20:55

Hadlock: Nov 9, 2004

Yeah it costs $100k/hr hard cost in revenue loss for us to be down with a MTTR of 45 min; plus the soft costs of ripping engineers away from whatever the hell they were doing that day. $35/mo/engineer is not even a rounding error. If our paging service was down for 15 minutes that just cost us $25k

# ? Oct 2, 2023 23:44

SurgicalOntologist: Jun 17, 2004

Since everyone mentioned it, yeah we'll do a Pagerduty trial too, I don't recall why we chose to try Zenduty first but those are the two on our list to try. But now that we're trying it I'm wondering if we have the right type of tool for our use case in the first place. For 4 uses pricing isn't really an issue as long as we only pay for responders.

Hadlock posted:

This sounds like a process issue not monitoring and alerting issue. I would push your ops manager to design a change control process to handle this

Something like three outcomes

1. ETL does not fail. No ticket

2. a. ETL fails
2. b. operator adds extra step
2. c. ETL works again. File ticket in JIRA to document change for change control purposes -> autoclose

3. a. ETL fails
3. b. operator adds extra step
3. c. ETL fails again. Operator files ticket in JIRA to escalate. This sends an email to ops manager and eng manager + director
3. d. JIRA esclalation ticket is triaged by X party (ops and eng manager?) and assigned
3. e. Engineer fixes ETL, sends to operator
3. f. ETL works again. operator closes ticket

Yeah, our ops process could definitely be better systematized, although this is pretty much what we're doing, just replace the JIRA ticket with a message tagging a Slack group with rotating membership, and we discuss in the thread.

But you're right, it's not necessarily a monitoring and alerting issue, more about the process and the communication. The key things we need are call routing and the ability to apply metadata (tags or categories) to incidents in order to track metric of the process -- nothing crazy.

We thought about a ticket-based system but those all seem too async for the number of back-and-forths we usually take and the speed with which we are typically resolving the problems. Although having typed that out, I'm wondering if we missed a whole category of product -- helpdesk with live chat. We use Gitlab issues for engineering work items which would be too async for incidents, but for sure there are chat-based support platforms that might make sense for us. Then we may still need pagerduty/zenduty for the call routing, so it starts to get complicated but anyway any one of these pieces will be an incremental improvement.

If we stick with Pagerduty/Zenduty then it looks like we need to implement a Slack bot against the incident management API. Which isn't a bad idea as we can also start to incorporate some of the manual steps into Slack commands/buttons (against our own API), something we've been planning. But my main question coming here was to look for any option where we don't have to build a Slackbot in order to support manually triggered incidents and provide ongoing feedback back to the one who triggered it. Which yeah... that's a helpdesk not an incident management platform, right?

# ? Oct 3, 2023 00:52

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Some minor notes since I'm on my phone:

Help desk is a business group that helps users with their computers, not a product category. I think you're thinking of a ticketing system instead. If you're already using Jira there might be integrations available for PD but I've never used Jira so no idea.

# ? Oct 3, 2023 03:47

Hadlock: Nov 9, 2004

SurgicalOntologist posted:

. These are the incidents we are most interested in tracking, to keep metrics about the frequency, attribution (e.g. bad input, operator error, bug), time to resolve etc.

Jira has an entire suite of analytics/graph stuff you can do all kinds of analytics on this stuff. Pager duty can shoot off an API request to both JIRA and slack, but IMO it's better to have pager duty generate a ticket, and then have JIRA send a notification to slack.

Slack is super cool but IMO it's a black hole for information. If you want to track metrics on process you need to document it in a database -> that is JIRA. People have a tendency to try and cram everything into slack. As they say, you can't manage what you can't measure, and slack is an amorphous blob of random garbage.

# ? Oct 3, 2023 04:37

Mustache Ride: Sep 11, 2001

Also if you subscribe to Jira cloud you get opsgenie, Atlassian's pagerduty for free.

# ? Oct 3, 2023 13:26

vanity slug: Jul 20, 2010

Mustache Ride posted:

Also if you subscribe to Jira cloud you get opsgenie, Atlassian's pagerduty for free.

Yeah, that's great if you're okay with your incident management system being unavailable for over a week.

# ? Oct 3, 2023 13:48

Trapick: Apr 17, 2006