|
The Iron Rose posted:I have some questions about measuring reliability across multiple micro services and organizing trace parents. For asynchronous and batch processes I think the way to handle it is with links rather than a parent/child relationship: https://opentelemetry.io/docs/reference/specification/overview/#links-between-spans. That way each distinct step of the process has its own trace but is linked back to a specific invocation of the cron or batch of messages submitted to the workflow engine. It means persisting a span ID through baggage but it doesn't result in getting some weird behaviour like having a heap of concurrent spans under the cron parent span. It's very similar to your idea with the UUID but a lot of the trace aggregator/visualisers will include links in the UI and make them a bit easier to visualise overall. If any given message has a different possible path through your system I'm tempted to say that it's better to care only about the reliability of each service. It should be pretty easy to make that happen dynamically, though, right? Have a reliability metric labelled with the service name and then have an alert on that metric that just includes the label. Would be much quicker to identify where the problem is happening and start troubleshooting unless I'm missing something?
|
# ? Apr 24, 2023 08:21 |
|
|
# ? Jun 8, 2024 16:36 |
|
whats for dinner posted:For asynchronous and batch processes I think the way to handle it is with links rather than a parent/child relationship: https://opentelemetry.io/docs/reference/specification/overview/#links-between-spans. That way each distinct step of the process has its own trace but is linked back to a specific invocation of the cron or batch of messages submitted to the workflow engine. It means persisting a span ID through baggage but it doesn't result in getting some weird behaviour like having a heap of concurrent spans under the cron parent span. It's very similar to your idea with the UUID but a lot of the trace aggregator/visualisers will include links in the UI and make them a bit easier to visualise overall. I edited my post a bit (most notably to correct bad math w/ my sampling formula - don’t do algebra at 3am kids). Put simply, reliability for any message M != reliability of all services through which M *might* traverse. I can easily measure reliability per service. But the more *useful* metric to provide to users (and customers!) is the success rate for the *average* message M, and possibly reliability for each *class* of M (which also have varying paths). It also helps us better guide engineering decisions by focusing on the more traversed nodes in our graph. Links sound very useful here also, thanks. The Iron Rose fucked around with this message at 08:38 on Apr 24, 2023 |
# ? Apr 24, 2023 08:30 |
|
The Iron Rose posted:I edited my post a bit (most notably to correct bad math w/ my sampling formula - don’t do algebra at 3am kids). Put simply, reliability for any message M != reliability of all services through which M *might* traverse. I can easily measure reliability per service. But the more *useful* metric to provide to users (and customers!) is the success rate for the *average* message M, and possibly reliability for each *class* of M. Right, got'cha. Before thinking about samplling my first instinct would be to have an attribute that represents the final success or failure of a message that gets set either when it finishes to workflow successfully or fails for a terminal reason. Then if the class of the message is known in advance and already exists as an attribute on the trace then the reliability for a given class of M is easy to calculate and expose to users. Then for workflow engine traces I'd try and split it up by message class before filtering successes and failures (I'm pretty sure that's doable with the otel collector).
|
# ? Apr 24, 2023 08:42 |
|
The Iron Rose posted:... Put simply, reliability for any message M != reliability of all services through which M *might* traverse. I can easily measure reliability per service. But the more *useful* metric to provide to users (and customers!) is the success rate for the ... Presenting a metric for customers is a completely different goal than internal engineering/troubleshooting, and you shouldn't handle them the same way. For example you might have an initial customer request create a bunch of failed traces, but if the automatic failure handling was good enough then maybe the customer didn't notice -- should that transaction count against your metric? Probably not.
|
# ? Apr 25, 2023 23:55 |
|
We did a 2-day exercise in my company about how to measure reliability of services, and one key takeaway was this: the client and the user aren't the same thing, and you need to be careful that you understand which one you're aiming at when you talk about reliability of an interaction using occurrences-based measurements. Like, a lagging indicator of absolute success rates of an interaction becomes a leading indicator of risk that a client interaction will become unsuccessful, without you even realizing it, the second that a client adds retries to a call.
|
# ? Apr 26, 2023 15:48 |
|
https://github.com/k8sgpt-ai/k8sgpt
|
# ? Apr 28, 2023 03:34 |
Just what I want right on the command line: an LLM that hallucinates answers to questions
|
|
# ? Apr 28, 2023 03:42 |
Perfect, now I have a watertight excuse for my sloppy work.
|
|
# ? Apr 28, 2023 03:49 |
|
Oh boy, the client forgot to pay Atlassian so all our documentation is going *poof* come Sunday night and boy the support team sure isn't responding to requests to grant export access to our spaces. https://www.youtube.com/watch?v=HdKqAVpUOwI
|
# ? Apr 28, 2023 03:49 |
|
Warbird posted:Oh boy, the client forgot to pay Atlassian so all our documentation is going *poof* come Sunday night and boy the support team sure isn't responding to requests to grant export access to our spaces. Could you use the CLIget extension and then wget the whole site?
|
# ? Apr 28, 2023 05:23 |
|
Saukkis posted:Could you use the CLIget extension and then wget the whole site? No. He can’t. We need content.
|
# ? Apr 28, 2023 05:25 |
I've been using docker environment files like .env.local and .env.prod to handle things like an ALLOWED_IP that ends up into an allow statement in an nginx config file via envsubst. It works fine but it seems cumbersome with lists of things, like if I want to have ALLOWED_IPS=192.168.5.1,172.16.3.1 that translates into multiple allow lines in nginx:code:
|
|
# ? Apr 29, 2023 20:11 |
|
fletcher posted:I've been using docker environment files like .env.local and .env.prod to handle things like an ALLOWED_IP that ends up into an allow statement in an nginx config file via envsubst. It works fine but it seems cumbersome with lists of things, like if I want to have ALLOWED_IPS=192.168.5.1,172.16.3.1 that translates into multiple allow lines in nginx:
|
# ? Apr 29, 2023 23:59 |
|
fletcher posted:I've been using docker environment files like .env.local and .env.prod to handle things like an ALLOWED_IP that ends up into an allow statement in an nginx config file via envsubst. It works fine but it seems cumbersome with lists of things, like if I want to have ALLOWED_IPS=192.168.5.1,172.16.3.1 that translates into multiple allow lines in nginx: To get around the "ALLOWED_IP_0/ALLOWED_IP_1/etc." problem have your main nginx conf pull in those allowlists via file "include" statements, and set a unique ip allowlist conf file per environment. You could then generate the per env allowlist conf file via env vars and a template system (I've done it with golang templates in kubernetes init containers when that was most appropriate), but another approach would be to have all the various allowlist conf files in source code and use an env var to declare which one is used for the env.
|
# ? Apr 30, 2023 02:49 |
|
fletcher posted:I've been using docker environment files like .env.local and .env.prod to handle things like an ALLOWED_IP that ends up into an allow statement in an nginx config file via envsubst. It works fine but it seems cumbersome with lists of things, like if I want to have ALLOWED_IPS=192.168.5.1,172.16.3.1 that translates into multiple allow lines in nginx: For local I would have the container import from a .csv or maybe a flat file (dealers choice) and then if prod=0 feed the csv into an nginx Jinja template and export to wherever For prod if prod=1 read from an S3 bucket or your favorite KV store like Redis; either hard code the url in the container or pass it in as a second variable Yeah there's no standard way of doing this. It's very common to have a bash script with a couple of switch statements to handle local vs prod. The problem is that developers trying to extend the container get confused about the switch statements and don't like asking for help and now "docker is bad because the necessary evil of complex configuration made me confused" Probably there should be an annual project to unfuck the way config is loaded into the container as it'll inevitably build up cruft. In theory helm solves this with sops but that's a different can of worms
|
# ? May 1, 2023 08:30 |
|
If I'm not building for reusability by some downstream consumer, I'd probably just have some kind of config management generating whole Nginx config files for the container to pull in
|
# ? May 1, 2023 15:10 |
|
What can I do to create a daily record of our AWS infrastructure, so I can look back at previous states and correct mistakes? I'm talking complete stare-and-compare, nothing about automated restore. This would probably be trivial if we had some kind of Terraform workflow but we don't. Having this will finally get my boss to trust me with more permissions.
|
# ? May 4, 2023 19:37 |
|
run cloudformer every time you make a change?
|
# ? May 4, 2023 19:43 |
|
I think you can get this poo poo out of AWS Config although it's not going to be in a super digestible format out of the box
|
# ? May 4, 2023 19:56 |
Zapf Dingbat posted:What can I do to create a daily record of our AWS infrastructure, so I can look back at previous states and correct mistakes? I'm talking complete stare-and-compare, nothing about automated restore. Do you have an example of a mistake you want this process to be able to address?
|
|
# ? May 4, 2023 20:02 |
|
Docjowles posted:I think you can get this poo poo out of AWS Config although it's not going to be in a super digestible format out of the box
|
# ? May 4, 2023 20:02 |
|
fletcher posted:Do you have an example of a mistake you want this process to be able to address? I have some routes and security groups that I changed or deleted, but I need to bring them back to what they looked like yesterday.
|
# ? May 4, 2023 20:10 |
Zapf Dingbat posted:I have some routes and security groups that I changed or deleted, but I need to bring them back to what they looked like yesterday. Sounds like you definitely need to get some IaaC going with terraform!
|
|
# ? May 4, 2023 20:28 |
|
|
# ? May 5, 2023 15:01 |
We have a cron job that uses AWS CLI to export resources & config to CSV files every day, I believe for some sort of compliance purpose. Something like that might be useful for your purposes. CloudTrail can also give you insight into changes over time, though it's more of an audit trail so I think it would be quite cumbersome to try to revert something to the way it was a few days ago
|
|
# ? May 5, 2023 17:27 |
|
Recruiters emails have gotten few and far between. ChatGPT and Stable Diffusion seem to be the current hot technology and 4/5 companies lately have some kind of "AI" component to them just like how everyone was using blockchain in some capacity three years ago. Anyways, any suggestions on how you'd tune your devops/infrastructure to better meet "AI" needs? The latest company I talked to is doing low latency text to speech. Do you just treat these nodes as heavy duty cpu/memory machines or is there some additional magic sprinkled here
|
# ? May 16, 2023 00:45 |
|
I just click a couple buttons to deploy an openai instance in azure if architecture approves it maybe we'll do a terraform module when the resource is added to the provider
|
# ? May 16, 2023 00:59 |
|
Hadlock posted:Recruiters emails have gotten few and far between. ChatGPT and Stable Diffusion seem to be the current hot technology and 4/5 companies lately have some kind of "AI" component to them just like how everyone was using blockchain in some capacity three years ago.
|
# ? May 16, 2023 01:14 |
|
Thanks. It didn't occur to me that Azure would have a $ endpoint service for GPT so quickly. I probably should sit down and play around with that stuff, haven't touched Azure beyond a few minutes to tweak one thing a million years ago New question - how would you auto provision a jupiter notebook, with cloud backup of the... I guess storage backend for those. I was talking about autoprovisioning helm charted monolith SAAS stuff during an interview and one of the interviewers started speculating on how to autoprovision this for new analyst types
|
# ? May 18, 2023 19:30 |
|
Hadlock posted:Thanks. It didn't occur to me that Azure would have a $ endpoint service for GPT so quickly. I probably should sit down and play around with that stuff, haven't touched Azure beyond a few minutes to tweak one thing a million years ago ms is a significant investor in openai, I would have been more surprised of they didn't turn it into an azure service
|
# ? May 18, 2023 20:02 |
|
Oh yeah I'm aware of the investment, I just didn't realize how fast they went from "this is a late stage research project", to, "production ready, gently caress yeah here we go" I guess since it's got a big asterisk*product is known to hallucinate, regularly; there's not a lot of QA that needs to go into the product to make it "production ready", vs something like fargate which is long lived and runs customer code
|
# ? May 18, 2023 20:19 |
There are a lot of gates around using it, not sure it’s really production ready but I have a client who is about to find out how well it can sell pants
|
|
# ? May 18, 2023 20:30 |
|
I think it's in some sort of private preview still? We had to get our account rep involved to turn it on for the subscriptions we wanted it on at least
|
# ? May 18, 2023 20:56 |
I’m not directly involved but as I understand it they’re performing ethics reviews and other AI-related things to ensure applicability and non skynetness
|
|
# ? May 18, 2023 20:59 |
|
gently caress Jenkins
|
# ? May 20, 2023 15:21 |
|
Warbird posted:gently caress Jenkins
|
# ? May 20, 2023 15:57 |
|
Warbird posted:gently caress Jenkins
|
# ? May 20, 2023 16:05 |
|
Jenkins is terrible. But still somehow also the best option.
|
# ? May 22, 2023 16:16 |
|
From a "get it done right loving now, we'll worry about tech debt in 2-5 years" jenkins is not terrible. It beats the hell out of a bunch of undiscoverable, undocumented scripts running from some undocumented unix user's personal cron and credential files/env vars stored haphazardly on the disk. If you point it at pipelines on a private git repo it's almost not embarrassing Jenkins is bad, but tending a garden of undocumented cron jobs with a broken hand-built "job failure alert" system is much, much worse
|
# ? May 22, 2023 17:52 |
|
|
# ? Jun 8, 2024 16:36 |
|
jenkins is completely irredeemable unless the context is "you've hired me, jenkins is what i know, and we can't justify spending the time for me to learn something new" if that's you, that's fine, and it's a reasonable decision for an organization to make. let's stop defending jenkins in the abstract though.
|
# ? May 22, 2023 18:23 |