Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
whats for dinner
Sep 25, 2006

IT TURN OUT METAL FOR DINNER!

The Iron Rose posted:

I have some questions about measuring reliability across multiple micro services and organizing trace parents.

I have an environment that runs on a cron, fetches messages from customer environments and posts them to a message queue. I have a consumer that reads and then posts the messages to a workflow system, which executes a bunch of DAGs depending on what it is. Our flow is fetch (many) messages-> post (many) to queue -> consume (many) and submit each message (one) to workflow engine -> ~things~ happen to each message (one), things == a DAG -> output each message (one). This entire system is treated as a major business service. We are about to begin instrumenting all its micro services with opentelemetry.

I’m not sure how I want to do trace parents for it. If I want mostly to look at metrics from a per message perspective, does it make sense to:

- start one trace parent for the fetch -> queue -> consume and submit bulk process, and a start a new trace parent at the workflow engine when we’re operating on one message at once?
— easier to get metrics on a per message lifetime basis w/o generating and persisting a messageID k/v pair through baggage
— the workflow system microservices and the “fetch and submit to workflow” microservices are technically and logically distinct
OR
- start a trace parent on fetch, persist trace data through the queue, and generate a UUID per message and stick it in the results?
- - more expensive to persist trace data across a message queue and the workflow submission
— disaggregating reliability when going from operating on many things to one thing seems like it’ll make some stats weird
— I do get the full end to end trace parent tho


ultimately, if I want to sample at 100-X% of successful traces (and 100% of failed traces) does it even matter never being able to say that for a specific message M we had/any have R% reliability?

I can always say that if A B … Z have AR%, BR% … ZR% reliability, that therefore reliability for any message M is AR + BR … ZR / N. but then how do I handle the fact that service C is only invoked 70% of the time and D is invoked 15% of the time and so on, without tracking success metrics per message, while also dropping some percentage of success traces? If I have a sample rate of 30 (keeping 30% of successful traces, dropping 70%), I guess I can just work back from that to get the correct success results by factoring the sample rate into F failed messages * ( ( S successful messages + (100 - sample rate)) / 100). That should give me the correct value for S before I feed it into my metric calculations. But then how would I account for the rate at which each microservice or sub component is used and feed that into my overall count without generating a metric for every message/microservice combo, at which point I feel like I’m not handling the sampling correctly. It’s less data than a full trace I guess.


Anyways I feel like I’m over thinking this. Advice is welcome and appreciated!

For asynchronous and batch processes I think the way to handle it is with links rather than a parent/child relationship: https://opentelemetry.io/docs/reference/specification/overview/#links-between-spans. That way each distinct step of the process has its own trace but is linked back to a specific invocation of the cron or batch of messages submitted to the workflow engine. It means persisting a span ID through baggage but it doesn't result in getting some weird behaviour like having a heap of concurrent spans under the cron parent span. It's very similar to your idea with the UUID but a lot of the trace aggregator/visualisers will include links in the UI and make them a bit easier to visualise overall.

If any given message has a different possible path through your system I'm tempted to say that it's better to care only about the reliability of each service. It should be pretty easy to make that happen dynamically, though, right? Have a reliability metric labelled with the service name and then have an alert on that metric that just includes the label. Would be much quicker to identify where the problem is happening and start troubleshooting unless I'm missing something?

Adbot
ADBOT LOVES YOU

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

whats for dinner posted:

For asynchronous and batch processes I think the way to handle it is with links rather than a parent/child relationship: https://opentelemetry.io/docs/reference/specification/overview/#links-between-spans. That way each distinct step of the process has its own trace but is linked back to a specific invocation of the cron or batch of messages submitted to the workflow engine. It means persisting a span ID through baggage but it doesn't result in getting some weird behaviour like having a heap of concurrent spans under the cron parent span. It's very similar to your idea with the UUID but a lot of the trace aggregator/visualisers will include links in the UI and make them a bit easier to visualise overall.

If any given message has a different possible path through your system I'm tempted to say that it's better to care only about the reliability of each service. It should be pretty easy to make that happen dynamically, though, right? Have a reliability metric labelled with the service name and then have an alert on that metric that just includes the label. Would be much quicker to identify where the problem is happening and start troubleshooting unless I'm missing something?

I edited my post a bit (most notably to correct bad math w/ my sampling formula - don’t do algebra at 3am kids). Put simply, reliability for any message M != reliability of all services through which M *might* traverse. I can easily measure reliability per service. But the more *useful* metric to provide to users (and customers!) is the success rate for the *average* message M, and possibly reliability for each *class* of M (which also have varying paths). It also helps us better guide engineering decisions by focusing on the more traversed nodes in our graph.


Links sound very useful here also, thanks.

The Iron Rose fucked around with this message at 08:38 on Apr 24, 2023

whats for dinner
Sep 25, 2006

IT TURN OUT METAL FOR DINNER!

The Iron Rose posted:

I edited my post a bit (most notably to correct bad math w/ my sampling formula - don’t do algebra at 3am kids). Put simply, reliability for any message M != reliability of all services through which M *might* traverse. I can easily measure reliability per service. But the more *useful* metric to provide to users (and customers!) is the success rate for the *average* message M, and possibly reliability for each *class* of M.


Links sound very useful here also, thanks.

Right, got'cha. Before thinking about samplling my first instinct would be to have an attribute that represents the final success or failure of a message that gets set either when it finishes to workflow successfully or fails for a terminal reason. Then if the class of the message is known in advance and already exists as an attribute on the trace then the reliability for a given class of M is easy to calculate and expose to users. Then for workflow engine traces I'd try and split it up by message class before filtering successes and failures (I'm pretty sure that's doable with the otel collector).

Extremely Penetrated
Aug 8, 2004
Hail Spwwttag.

The Iron Rose posted:

... Put simply, reliability for any message M != reliability of all services through which M *might* traverse. I can easily measure reliability per service. But the more *useful* metric to provide to users (and customers!) is the success rate for the :techno: ...

Presenting a metric for customers is a completely different goal than internal engineering/troubleshooting, and you shouldn't handle them the same way. For example you might have an initial customer request create a bunch of failed traces, but if the automatic failure handling was good enough then maybe the customer didn't notice -- should that transaction count against your metric? Probably not.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.
We did a 2-day exercise in my company about how to measure reliability of services, and one key takeaway was this: the client and the user aren't the same thing, and you need to be careful that you understand which one you're aiming at when you talk about reliability of an interaction using occurrences-based measurements. Like, a lagging indicator of absolute success rates of an interaction becomes a leading indicator of risk that a client interaction will become unsuccessful, without you even realizing it, the second that a client adds retries to a call.

jaegerx
Sep 10, 2012

Maybe this post will get me on your ignore list!


https://github.com/k8sgpt-ai/k8sgpt

i am a moron
Nov 12, 2020

"I think if there’s one thing we can all agree on it’s that Penn State and Michigan both suck and are garbage and it’s hilarious Michigan fans are freaking out thinking this is their natty window when they can’t even beat a B12 team in the playoffs lmao"
Just what I want right on the command line: an LLM that hallucinates answers to questions

madmatt112
Jul 11, 2016

Is that a cat in your pants, or are you just a lonely excuse for an adult?


Perfect, now I have a watertight excuse for my sloppy work.

Warbird
May 23, 2012

America's Favorite Dumbass

Oh boy, the client forgot to pay Atlassian so all our documentation is going *poof* come Sunday night and boy the support team sure isn't responding to requests to grant export access to our spaces.

https://www.youtube.com/watch?v=HdKqAVpUOwI

Saukkis
May 16, 2003

Unless I'm on the inside curve pointing straight at oncoming traffic the high beams stay on and I laugh at your puny protest flashes.
I am Most Important Man. Most Important Man in the World.

Warbird posted:

Oh boy, the client forgot to pay Atlassian so all our documentation is going *poof* come Sunday night and boy the support team sure isn't responding to requests to grant export access to our spaces.

https://www.youtube.com/watch?v=HdKqAVpUOwI

Could you use the CLIget extension and then wget the whole site?

jaegerx
Sep 10, 2012

Maybe this post will get me on your ignore list!


Saukkis posted:

Could you use the CLIget extension and then wget the whole site?

No. He can’t. We need content.

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb
I've been using docker environment files like .env.local and .env.prod to handle things like an ALLOWED_IP that ends up into an allow statement in an nginx config file via envsubst. It works fine but it seems cumbersome with lists of things, like if I want to have ALLOWED_IPS=192.168.5.1,172.16.3.1 that translates into multiple allow lines in nginx:
code:
allow 192.168.5.1;
allow 172.16.3.1;
I then I ended up just having variables like ALLOWED_IP_0 and ALLOWED_IP_1 but this is a pain in the rear end if there's a variable number of IPs across the different environments. I was thinking a more powerful templating engine like Jinja would be helpful, but I wasn't quite sure how to pass the json of variables in the docker compose command. Or maybe there's some better way to handle this?

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

fletcher posted:

I've been using docker environment files like .env.local and .env.prod to handle things like an ALLOWED_IP that ends up into an allow statement in an nginx config file via envsubst. It works fine but it seems cumbersome with lists of things, like if I want to have ALLOWED_IPS=192.168.5.1,172.16.3.1 that translates into multiple allow lines in nginx:
code:
allow 192.168.5.1;
allow 172.16.3.1;
I then I ended up just having variables like ALLOWED_IP_0 and ALLOWED_IP_1 but this is a pain in the rear end if there's a variable number of IPs across the different environments. I was thinking a more powerful templating engine like Jinja would be helpful, but I wasn't quite sure how to pass the json of variables in the docker compose command. Or maybe there's some better way to handle this?
There's no standard way, just use whatever method of rendering out a config file at startup feels natural to you. sed is fine for some folks, others like consul-template, many folks will write a Jinja2 thing. Just be aware than Jinja2 doesn't have a from_json filter (this is an Ansible extension), so whatever wrapper script is calling it will have to do the work of decoding envvars where necessary.

trem_two
Oct 22, 2002

it is better if you keep saying I'm fat, as I will continue to score goals
Fun Shoe

fletcher posted:

I've been using docker environment files like .env.local and .env.prod to handle things like an ALLOWED_IP that ends up into an allow statement in an nginx config file via envsubst. It works fine but it seems cumbersome with lists of things, like if I want to have ALLOWED_IPS=192.168.5.1,172.16.3.1 that translates into multiple allow lines in nginx:
code:
allow 192.168.5.1;
allow 172.16.3.1;
I then I ended up just having variables like ALLOWED_IP_0 and ALLOWED_IP_1 but this is a pain in the rear end if there's a variable number of IPs across the different environments. I was thinking a more powerful templating engine like Jinja would be helpful, but I wasn't quite sure how to pass the json of variables in the docker compose command. Or maybe there's some better way to handle this?

To get around the "ALLOWED_IP_0/ALLOWED_IP_1/etc." problem have your main nginx conf pull in those allowlists via file "include" statements, and set a unique ip allowlist conf file per environment. You could then generate the per env allowlist conf file via env vars and a template system (I've done it with golang templates in kubernetes init containers when that was most appropriate), but another approach would be to have all the various allowlist conf files in source code and use an env var to declare which one is used for the env.

Hadlock
Nov 9, 2004

fletcher posted:

I've been using docker environment files like .env.local and .env.prod to handle things like an ALLOWED_IP that ends up into an allow statement in an nginx config file via envsubst. It works fine but it seems cumbersome with lists of things, like if I want to have ALLOWED_IPS=192.168.5.1,172.16.3.1 that translates into multiple allow lines in nginx:
code:
allow 192.168.5.1;
allow 172.16.3.1;
I then I ended up just having variables like ALLOWED_IP_0 and ALLOWED_IP_1 but this is a pain in the rear end if there's a variable number of IPs across the different environments. I was thinking a more powerful templating engine like Jinja would be helpful, but I wasn't quite sure how to pass the json of variables in the docker compose command. Or maybe there's some better way to handle this?

For local I would have the container import from a .csv or maybe a flat file (dealers choice) and then if prod=0 feed the csv into an nginx Jinja template and export to wherever

For prod if prod=1 read from an S3 bucket or your favorite KV store like Redis; either hard code the url in the container or pass it in as a second variable

Yeah there's no standard way of doing this. It's very common to have a bash script with a couple of switch statements to handle local vs prod.

The problem is that developers trying to extend the container get confused about the switch statements and don't like asking for help and now "docker is bad because the necessary evil of complex configuration made me confused"

Probably there should be an annual project to unfuck the way config is loaded into the container as it'll inevitably build up cruft. In theory helm solves this with sops but that's a different can of worms

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.
If I'm not building for reusability by some downstream consumer, I'd probably just have some kind of config management generating whole Nginx config files for the container to pull in

Zapf Dingbat
Jan 9, 2001


What can I do to create a daily record of our AWS infrastructure, so I can look back at previous states and correct mistakes? I'm talking complete stare-and-compare, nothing about automated restore.

This would probably be trivial if we had some kind of Terraform workflow but we don't. Having this will finally get my boss to trust me with more permissions.

The Fool
Oct 16, 2003


run cloudformer every time you make a change?

Docjowles
Apr 9, 2009

I think you can get this poo poo out of AWS Config although it's not going to be in a super digestible format out of the box

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb

Zapf Dingbat posted:

What can I do to create a daily record of our AWS infrastructure, so I can look back at previous states and correct mistakes? I'm talking complete stare-and-compare, nothing about automated restore.

This would probably be trivial if we had some kind of Terraform workflow but we don't. Having this will finally get my boss to trust me with more permissions.

Do you have an example of a mistake you want this process to be able to address?

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Docjowles posted:

I think you can get this poo poo out of AWS Config although it's not going to be in a super digestible format out of the box
The format is fine, it's basically a resource-agnostic schema with the resource attributes packed into a JSON-string field

Zapf Dingbat
Jan 9, 2001


fletcher posted:

Do you have an example of a mistake you want this process to be able to address?

I have some routes and security groups that I changed or deleted, but I need to bring them back to what they looked like yesterday.

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb

Zapf Dingbat posted:

I have some routes and security groups that I changed or deleted, but I need to bring them back to what they looked like yesterday.

Sounds like you definitely need to get some IaaC going with terraform!

Zapf Dingbat
Jan 9, 2001


:argh:

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb
We have a cron job that uses AWS CLI to export resources & config to CSV files every day, I believe for some sort of compliance purpose. Something like that might be useful for your purposes. CloudTrail can also give you insight into changes over time, though it's more of an audit trail so I think it would be quite cumbersome to try to revert something to the way it was a few days ago

Hadlock
Nov 9, 2004

Recruiters emails have gotten few and far between. ChatGPT and Stable Diffusion seem to be the current hot technology and 4/5 companies lately have some kind of "AI" component to them just like how everyone was using blockchain in some capacity three years ago.

Anyways, any suggestions on how you'd tune your devops/infrastructure to better meet "AI" needs? The latest company I talked to is doing low latency text to speech. Do you just treat these nodes as heavy duty cpu/memory machines or is there some additional magic sprinkled here

The Fool
Oct 16, 2003


I just click a couple buttons to deploy an openai instance in azure

if architecture approves it maybe we'll do a terraform module when the resource is added to the provider

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Hadlock posted:

Recruiters emails have gotten few and far between. ChatGPT and Stable Diffusion seem to be the current hot technology and 4/5 companies lately have some kind of "AI" component to them just like how everyone was using blockchain in some capacity three years ago.

Anyways, any suggestions on how you'd tune your devops/infrastructure to better meet "AI" needs? The latest company I talked to is doing low latency text to speech. Do you just treat these nodes as heavy duty cpu/memory machines or is there some additional magic sprinkled here
At the end of the day there are services and batch/training, just like everyone had before. The major gray area is around model testing and data quality, and to what extent you can preempt a long-running job when necessary

Hadlock
Nov 9, 2004

Thanks. It didn't occur to me that Azure would have a $ endpoint service for GPT so quickly. I probably should sit down and play around with that stuff, haven't touched Azure beyond a few minutes to tweak one thing a million years ago

New question - how would you auto provision a jupiter notebook, with cloud backup of the... I guess storage backend for those. I was talking about autoprovisioning helm charted monolith SAAS stuff during an interview and one of the interviewers started speculating on how to autoprovision this for new analyst types

The Fool
Oct 16, 2003


Hadlock posted:

Thanks. It didn't occur to me that Azure would have a $ endpoint service for GPT so quickly. I probably should sit down and play around with that stuff, haven't touched Azure beyond a few minutes to tweak one thing a million years ago

ms is a significant investor in openai, I would have been more surprised of they didn't turn it into an azure service

Hadlock
Nov 9, 2004

Oh yeah I'm aware of the investment, I just didn't realize how fast they went from "this is a late stage research project", to, "production ready, gently caress yeah here we go"

I guess since it's got a big asterisk*product is known to hallucinate, regularly; there's not a lot of QA that needs to go into the product to make it "production ready", vs something like fargate which is long lived and runs customer code

i am a moron
Nov 12, 2020

"I think if there’s one thing we can all agree on it’s that Penn State and Michigan both suck and are garbage and it’s hilarious Michigan fans are freaking out thinking this is their natty window when they can’t even beat a B12 team in the playoffs lmao"
There are a lot of gates around using it, not sure it’s really production ready but I have a client who is about to find out how well it can sell pants

The Fool
Oct 16, 2003


I think it's in some sort of private preview still? We had to get our account rep involved to turn it on for the subscriptions we wanted it on at least

i am a moron
Nov 12, 2020

"I think if there’s one thing we can all agree on it’s that Penn State and Michigan both suck and are garbage and it’s hilarious Michigan fans are freaking out thinking this is their natty window when they can’t even beat a B12 team in the playoffs lmao"
I’m not directly involved but as I understand it they’re performing ethics reviews and other AI-related things to ensure applicability and non skynetness

Warbird
May 23, 2012

America's Favorite Dumbass

gently caress Jenkins

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Warbird posted:

gently caress Jenkins
Jenkins is the best product in the world at being pathologically Jenkins

Docjowles
Apr 9, 2009

Warbird posted:

gently caress Jenkins

:haibrow:

deedee megadoodoo
Sep 28, 2000
Two roads diverged in a wood, and I, I took the one to Flavortown, and that has made all the difference.


Jenkins is terrible. But still somehow also the best option.

Hadlock
Nov 9, 2004

From a "get it done right loving now, we'll worry about tech debt in 2-5 years" jenkins is not terrible. It beats the hell out of a bunch of undiscoverable, undocumented scripts running from some undocumented unix user's personal cron and credential files/env vars stored haphazardly on the disk. If you point it at pipelines on a private git repo it's almost not embarrassing

Jenkins is bad, but tending a garden of undocumented cron jobs with a broken hand-built "job failure alert" system is much, much worse

Adbot
ADBOT LOVES YOU

12 rats tied together
Sep 7, 2006

jenkins is completely irredeemable unless the context is "you've hired me, jenkins is what i know, and we can't justify spending the time for me to learn something new"

if that's you, that's fine, and it's a reasonable decision for an organization to make. let's stop defending jenkins in the abstract though.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply