Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
12 rats tied together
Sep 7, 2006

Are you sure about that? I can't reproduce that behavior with ansible 2.8, if I run a play against real, remote hosts and set connection: local ansible happily gathers facts from my laptop.

The only way I can get ansible to behave the way you describe is to override the connection at a task level, not play or host, which is expected because the play is still executing in the context of a host using the ssh connection plugin, just a particular task in the play happens to be using a different plugin. There's a section in the documentation that obliquely describes this behavior under "delegated facts".

The last time I was configuring network devices the modules all supported a "provider" task parameter which included an additional location to set connection arguments. We'd still run "local playbooks" that gather facts from our laptops (usually this is turned off because it is useless) but we'd inject group and host variables into the provider's connection block to basically "run locally, but ssh with these options". I see that this is now deprecated though and they recommend setting ansible_connection as network_cli, which does sound much better.

In my current role the only network devices I've configured have been cumulus linux devices, which you just SSH to like normal, and vyos devices where I am still using the deprecated "local playbook, remote provider" block.

I don't think it's very accurate to say that there is a "best" way to set a behavioral inventory parameter like ansible_connection. There are many ways to set these values, which result in different behavior, and they are all equally valid and useful unless you go out of your way to misuse them, for example, repeatedly setting an identical host variable instead of just using a group.

Adbot
ADBOT LOVES YOU

Pile Of Garbage
May 28, 2007



12 rats tied together posted:

Are you sure about that? I can't reproduce that behavior with ansible 2.8, if I run a play against real, remote hosts and set connection: local ansible happily gathers facts from my laptop.

The only way I can get ansible to behave the way you describe is to override the connection at a task level, not play or host, which is expected because the play is still executing in the context of a host using the ssh connection plugin, just a particular task in the play happens to be using a different plugin. There's a section in the documentation that obliquely describes this behavior under "delegated facts".

Whoops yeah my bad, forgot you can set the connection on plays. Also I've been doing too much network automation stuff where you always set gather_facts: false.

12 rats tied together posted:

I don't think it's very accurate to say that there is a "best" way to set a behavioral inventory parameter like ansible_connection. There are many ways to set these values, which result in different behavior, and they are all equally valid and useful unless you go out of your way to misuse them, for example, repeatedly setting an identical host variable instead of just using a group.

The advantage I found with setting ansible_connection on hosts in the inventory is that it's easier to override elsewhere thanks to variable precedence.

The last Ansible thing I was working on was playbooks to facilitate management configuration cut-over of IOS devices with the kicker being that it needed to support SSH and Telnet (Legacy garbage). I created a custom inventory script that would read a CSV file from S3 and spit out an inventory of devices. The script would also do a TCP socket test against each device to determine whether they were reachable via SSH or Telnet. If a device was reachable via SSH then ansible_connection would be set to network_cli on the inventory host however if it was only reachable via Telnet it would be set to local. Finally, in the playbook roles I used conditionals that would inspect the value of ansible_connection and then either apply configuration using the IOS modules or using the telnet module (Gross but it worked).

In addition the playbook was also responsible for updating our in-house CMDB via a REST API. I did that using the uri module and thanks to precedence I could override connection at the task level.

Edit: wow that's a lot of words about stuff that is barely tangentially related to CI/CD I'll shut-up for a bit.

Pile Of Garbage fucked around with this message at 08:47 on Sep 22, 2019

Necronomicon
Jan 18, 2004

Can anybody provide some conventional wisdom re: Terraform backends in AWS? Specifically regarding things like multiple managed environments. Should each deployment have its own specific S3 bucket and DynamoDB table? For instance, I currently have four deployments - Company A Staging, Company A Production, Company B Staging, and Company B Production. Is there a clever way of keeping all of those state and lock files in the same location to keep things nice and clean, or is it better for them each to have their own isolated environment?

I started from scratch about a month ago, mostly working off the the Gruntwork Terraform guides, so I might have missed some important bits along the way.

Docjowles
Apr 9, 2009

Sounds like you're in a consulting role? It seems extremely prudent to store the state files for different companies in distinct S3 buckets (or whatever your cloud provider calls this) at minimum, if not entirely separate accounts. That way some sort of fuckup or credential breach for one client at least doesn't affect the others. If a consultant came to us with "hey uh I was doing some work for another client and accidentally deleted your state files lmao, whoops" they wouldn't be employed any longer.

I don't know that you need to go to that level of paranoia for staging vs production within one company, though they should definitely each have their own state files so stg can't gently caress up prod.

Bhodi
Dec 9, 2007

Oh, it's just a cat.
Pillbug

Necronomicon posted:

Can anybody provide some conventional wisdom re: Terraform backends in AWS? Specifically regarding things like multiple managed environments. Should each deployment have its own specific S3 bucket and DynamoDB table? For instance, I currently have four deployments - Company A Staging, Company A Production, Company B Staging, and Company B Production. Is there a clever way of keeping all of those state and lock files in the same location to keep things nice and clean, or is it better for them each to have their own isolated environment?

I started from scratch about a month ago, mostly working off the the Gruntwork Terraform guides, so I might have missed some important bits along the way.
Absolutely use separate state files or you WILL be sorry. in AWS, use dynamo and S3. You can share dynamo tables (the key is per root module name) and share s3 buckets (the key is the directory name within the bucket) for cleanliness. All you need is

backend.config posted:

bucket = "my-s3-bucket-name"
dynamodb_table = "terraform-state-lock-dynamo"
key = "whatever-name-you-want-maybe-module-name-but-this-becomes-a-dirname-in-bucket"

backend.tf posted:

terraform {
backend "s3" {
encrypt = true
region = "your-favorite-aws-region"
}
}
then you just "terraform init -backend-config backend.config -upgrade" as normal


In fact, I advocate for multiple state files within the env, like if you are using terraform to deploy EVERYTHING, I highly HIGHLY suggest you isolate your vpc, security group, and IAM stuff from ec2 instance deployment and management. Ignore this at your own peril.

Necronomicon
Jan 18, 2004

I suppose that makes sense, yeah - I was going to try to use a single (previously defined) bucket with multiple subfolders like so:
code:
terraform {
  backend "s3" {
    bucket = "${var.tf_s3_bucket}"
    region = "${var.region}"
    key    = "${var.customer_name}/${var.product_name}/${var.environment_name}"
  }
}
...but Terraform yelled at me, since apparently you can't use variables or expressions within a backend config. So I'm stuck hard-coding (at the very least) the key for every single deployed environment, which annoys the hell out of me.

whats for dinner
Sep 25, 2006

IT TURN OUT METAL FOR DINNER!

Can only really speak for how we do it but each environment gets its own S3 bucket for state then we define databases and persistent storage in their own workspace with the rest of the application going in a different workspace. Each environment gets its own DynamoDB table for state locking but that's just a side effect of each environment having its own account.

At the very least I definitely agree that each company should have its own account and maybe even separate production and non-prod environments into their own accounts too

Edit: pass in backend variables on the command line when you init: https://www.terraform.io/docs/backends/config.html under "partial config"

whats for dinner fucked around with this message at 20:23 on Sep 23, 2019

Bhodi
Dec 9, 2007

Oh, it's just a cat.
Pillbug

Necronomicon posted:

...but Terraform yelled at me, since apparently you can't use variables or expressions within a backend config. So I'm stuck hard-coding (at the very least) the key for every single deployed environment, which annoys the hell out of me.
Hopefully your deployments are in different directories that only contain your environment variables, then you import submodules that actually do the work... If that's the case, all you need is one separate backend file per directory. Which is yeah, but doesn't even rank on the top 10 terraform annoyances

Necronomicon
Jan 18, 2004

Bhodi posted:

Hopefully your deployments are in different directories that only contain your environment variables, then you import submodules that actually do the work... If that's the case, all you need is one separate backend file per directory. Which is yeah, but doesn't even rank on the top 10 terraform annoyances

That's definitely the case. I have a modules directory that contains the setup for each of our two flagship products (we're a SaaS company, more or less), and then another directory per deployment of said product that calls those modules and just contains environment variables and the backend definition.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
You can always try Terraform Cloud if you’re not regulated and such to avoid state file and provider mambo.

Matt Zerella
Oct 7, 2002

Norris'es are back baby. It's good again. Awoouu (fox Howl)
How do you go about breaking up a monolithic terraform project that you deployed already with an S3 backend?

Just create the sub folders, copy/paste and terraform plan with different keys defined in the bucket that mimics the path?

Asking for a dumb friend who did a monolithic deploy. No not me. Why do you think that? (I'm learning and this sis to super critical but I'd like to follow best practices here to avoid problems down the road).

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
So you'll want to break up your Terraform project into different state boundaries that make sense for your situation decomposing the parts as appropriate for your deployment / update needs and then try https://medium.com/@lynnlin827/moving-terraform-resources-states-from-one-remote-state-to-another-c76f8b76a996

12 rats tied together
Sep 7, 2006

The short version of the medium article is that the commands you are looking for are all under "terraform state", hopefully you can get by with "mv", if you're unlucky like me you might need to do a bunch of "rm" followed by "import". It's a really tedious process but it is at least possible, which is one of the big reasons to use terraform IMO.

Keep in mind that you can go too far when breaking up a monolith. Terraform constantly asks you to make a tradeoff between usability and safety, every new state you create is a new entry in a dependency graph that Terraform is not designed to interact with at all. The biggest landmine here is that you should never have a module that invokes another module itself, you want to instead invoke all of your modules in your parent state and pass data between them inside that parent state.

This wasn't in the documentation for a very long time and is one of the reasons why I've spent an unreasonable amount of time running "terraform state" subcommands.

Pile Of Garbage posted:

Also I've been doing too much network automation stuff where you always set gather_facts: false.

The advantage I found with setting ansible_connection on hosts in the inventory is that it's easier to override elsewhere thanks to variable precedence.

Edit: wow that's a lot of words about stuff that is barely tangentially related to CI/CD I'll shut-up for a bit.

Network automation stuff in ansible usually does require a mindset shift, IMO. I think one of the weaknesses of ansible as a tool is that it lends itself really well to a monorepo approach, and there are a ton of things you set up/configure and stuff into .gitignore, ansible.cfg, etc, inside your monorepo and then totally forget about until you change jobs. The first like 2-3 days of creating the ansible repository at my current employer was just me going "wait, what the gently caress?" and reading documentation and github issues for 45 minutes, fixing one thing, and then repeating until it behaved like I remembered.

Re: host/group vars, totally agreed. That is definitely using them as intended and as I mentioned in a previous post, the existing mechanisms for variable precedence are extremely robust and powerful. If you use them properly you do not need any other configuration store for your IaC, period. The biggest feature gap is there's no easy way for an application your org manages to "pull" from ansible (there are a couple of not-easy ways though) -- in this case though you can trivially have ansible write application configuration to redis, consul, or whatever. Ansible is a great tool for this because it natively stores secrets, everything runs from git branches so it is natively versioned, and in general it just fits neatly into the "github workflow but for infrastructure" thing that most people seem to be doing these days.

As you know, there are other ways to set variables than checking them into /group_vars/mygroup/somefile.yaml. I think I misspoke here though because regardless of where or how you set these variables, they still exist inside the overall "group_vars" variable plugin. Setting them in other places than a group file is still just using them normally, so it's not really worth mentioning specifically. :shobon:

I also think managing network device config is one of the more extreme examples of CI/CD and is totally appropriate for the thread, which is pretty clearly evidenced by the amount of imperative logic you needed to support in order to sanely manage your configs across various devices and then update a CMDB at the end. Running "rubocop whatever" and then "cap deploy production" inside a jenkins agent is frankly boring by comparison.

LochNessMonster
Feb 3, 2005

I need about three fitty


Apparently our SonarQube license is based on server id which changes for each machine. I'd like to move this away from a 24/7 machine as we only utilize SQ during business hours, which means terminating the machine saves me 100ish hours of compute time each week. How do you typically handle these kind of licenses with regards to immutable infra?

Gyshall
Feb 24, 2009

Had a couple of drinks.
Saw a couple of things.
We run SQ in a container and don't have that issue. What's your setup look like?

Docjowles
Apr 9, 2009

Is anyone here running Prometheus for metric gathering and querying? We use it on a small scale for monitoring our Kubernetes pods. We have some developers who want to start expanding the usage, exposing and scraping far more detailed metrics from their apps and keeping them around for an extended time (1 year+ vs the 4 weeks we currently have configured). The guy on my team who has spent the most time on Prometheus stuff is pushing back VERY HARD on this and I'm looking for other experiences. His main concern is that (according to him) it's not a tool designed for this purpose; it's meant to store and query only a small set of data that quickly ages out. So the whole app will quickly blow up and die if we, say, multiply the amount of stuff we're storing by 100x.

I also see that there are a variety of external backends and adapters you can use that are explicitly for long term storage, but he again feels they are bandaids and the core of Prometheus will never handle this use case well.

Is this true? Was it ever true? This has the smell of outdated "conventional wisdom" to me that was the case in Prometheus 0.1 beta or something but no longer holds. Like, there have to be companies out there with very large scale Prometheus deployments at this point. And running both Prometheus plus a whole secondary metrics pipeline (statsd or what have you) for anything you want to keep for months rather than weeks also seems insane. But I am a novice in this area so maybe I am entirely off base and he's right.

Very interested in anyone else's experiences here.

12 rats tied together
Sep 7, 2006

My employer uses it for a fairly large deployment, I guess, somewhere over 5000 nodes and around another thousand-ish containers. We contracted a 3rd party to host our long term storage because we didn't want to care about it. It's hard for me to enthusiastically recommend the product, but there's nothing wrong with it. The query language is pretty awful (but works) and the alarming system is not great (but works).

At its core prometheus is just a pull-based time series database and query language, there's nothing fundamental about it that makes storing the time series data for longer than a couple hours any more difficult than any other time series database.
He could be saying that that the labeling system (every unique combination of labels results in a unique time series) makes it particularly ill-suited for wider use at your organization? It seems like either he might be mistaken or there might be a disconnect here, you can linearly scale the number of metrics you're storing by 100 without issue.

What you need to be careful of is introducing a cardinality bomb like, for example, adding a source IP address label to a web traffic metric along with something like HTTP path. Since every unique combination of labels is a unique time series, you'll end up exponentially increasing the amount of metrics you're storing and then yeah prometheus will explode and it will be bad. Prometheus exposes a "samples_scraped" metric itself and highly encourages you to watch and alarm on it, but just by nature of pull-based alarming pushing a bad config that cubes your stored metrics will be immediately harmful. "Rolling back" a label change is also often annoying, because adding a new label will immediately split your data into a new time series for every value of that label.

For this reason I really think you need context for prometheus, it's not something you can have 10 people spend 10% of their time on. Reviewing changes to labeling, scrape configs, routing rules, relabeling if applicable, deploying those changes, and then watching and responding to them and all of the other prometheus stats can be a full time job at even small scale.

If you use AWS I would recommend putting everything into cloudwatch instead.

LochNessMonster
Feb 3, 2005

I need about three fitty


Gyshall posted:

We run SQ in a container and don't have that issue. What's your setup look like?

Currently it’s at an MSP on a virtual server with a dedicated DB server. Looking to deploy it an EKS cluster.

Last time the MSP migrated it to a new server we had to request a new license because the new install used a new Server ID which caused the old license to stop working.

Cerberus911
Dec 26, 2005
Guarding the damned since '05
Besides what has already been said about keeping an eye on metric cardinality, I would also be careful on how much data you have, and how many prometheus servers.

We have a 3 month retention period and I've found queries that span more than 8 weeks can take quite a while. In total I think our prometheus server is using around 500Gb for storing those timeseries.
Another problem is that the same server is responsible for both scraping the data and responding to user queries. As the number of users scales up you'll find the queries take up the most processing power.

If going to a one year retention period I would not keep the current setup. I would switch to a long-term query storage system like thanos and point all the queries to that. The prometheus servers would then be mainly responsible for scraping metrics, and would have a short retention period.
Not sure how well such a setup would work since I haven't tried it yet.

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison
IMO depending on the size of your team/org running your own metrics infra is going to be pain and sadness

freeasinbeer
Mar 26, 2015

by Fluffdaddy
Just don’t get datadog. They’ll nickel and dime you.

Cerberus911
Dec 26, 2005
Guarding the damned since '05

freeasinbeer posted:

Just don’t get datadog. They’ll nickel and dime you.

We are switching to datadog for logs and some of their tracing features. How bad are they?

freeasinbeer
Mar 26, 2015

by Fluffdaddy

Cerberus911 posted:

We are switching to datadog for logs and some of their tracing features. How bad are they?

Agent software is decent, but so are signalfx and sysdig, UI is probably the best, signalfx close behind. Sysdig has the most advanced stuff because of the direction they went, but their UI doesn’t reflect that, and as that’s the end users primary impression, it carries the most weight.

Where Datadog really stick it to you is overage costs, and their host based pricing model is really terrible. Not only does it provide a very limited number quota for whatever out of the box, it will do stuff like charge you the per host price for anything it can see in AWS, even if that just means it’s scraping cloudwatch. $15 to scrape cloudwatch per elb or RDS instance is really terrible no matter how you slice it.

As for the logging, we looked at that, but it was gonna cost us $112000 a month to store logs for 7 days at list price, for what you can get for $10k a month using AWS hosted ES.

It’s decent but it’s insanely expensive, and it’s not that good.

Multiple folks have pointed out that for the truckloads of money you drop on datadog you could use cloudwatch plus grafana for much cheaper.

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison
i have a lot of opinions about monitoring nee observability but since i work for a vendor they're obviously biased. here's my 0.02 in general though -

datadog is extremely expensive for what you actually get out of it.
signalfx is probably going to get bad post-splunk acquisition
most of the big monitoring/apm companies are more interested in selling skus so they'll come in cheap then get you as you scale, but a lot of newer stuff scales better.

the best thing you can do is build around oss instrumentation (opentelemetry if you can wait a bit, opencensus/opentracing today if you can't) because it'll let you transition from self-run stuff to saas solutions. if you don't have anything, start instrumenting your crap and pipe traces to jaeger/metrics to prometheus with like a week's retention and see how much better it makes daily ops life. ultimately the real power of the saas tools is that they can do a lot of stuff that you can't or won't build yourself - figuring out what's important automatically, helping you define SLO/SLI's, etc.

12 rats tied together
Sep 7, 2006

One thing I want to do that I'm having trouble finding mention of in marketing materials is something akin to complex event processing. Basically I don't really care about average CPU utilization across a cluster of compute nodes, but while operating this cluster of compute nodes, and the application running on them, we've noticed a number of events that occur throughout the application lifecycle.

It would be really sick if I could ring my phone when we see one type of event happen and then we don't see any followup events of a different type inside the next 24 hours, for example.

Cancelbot
Nov 22, 2006

Canceling spam since 1928

On Datadog/APM in general we're currently trialling AppDynamics and Dynatrace as a thing to replace NewRelic; does anyone have opinions on these?

So far AppDynamics hooks you in with its "AI" baselines and fancy service map, but a lot of the tech seems old and creaky like having to run the JVM to monitor Windows/.NET hosts and weird phantom alerts where we got woke at 2am. Dynatrace seems like a significantly more complete product and its frontend integration is stupidly good.

What also seems to work in Dynatraces favour is its per hour model vs AppDynamics per host for at least a year model. I can roll the licensing in with our AWS bills for pay-per-use, but I need to see about getting the incentives applied to our account. The only thing I can't seem to find is an insights-equivalent product.

Qtotonibudinibudet
Nov 7, 2011



Omich poluyobok, skazhi ty narkoman? ya prosto tozhe gde to tam zhivu, mogli by vmeste uyobyvat' narkotiki

Docjowles posted:

We have some developers who want to start expanding the usage, exposing and scraping far more detailed metrics from their apps and keeping them around for an extended time (1 year+ vs the 4 weeks we currently have configured).

The gently caress strange world do you exist in where developers are going to look at year-old operational metrics and glean useful information from them?

Most of the devs I work with can't remember the architecture of code they wrote a year ago, much less how it would impact metrics.

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison

12 rats tied together posted:

One thing I want to do that I'm having trouble finding mention of in marketing materials is something akin to complex event processing. Basically I don't really care about average CPU utilization across a cluster of compute nodes, but while operating this cluster of compute nodes, and the application running on them, we've noticed a number of events that occur throughout the application lifecycle.

It would be really sick if I could ring my phone when we see one type of event happen and then we don't see any followup events of a different type inside the next 24 hours, for example.

I can’t think of anyone doing this (hell, I’d be vaguely surprised if google et. al. were doing this)

Cancelbot posted:

On Datadog/APM in general we're currently trialling AppDynamics and Dynatrace as a thing to replace NewRelic; does anyone have opinions on these?

So far AppDynamics hooks you in with its "AI" baselines and fancy service map, but a lot of the tech seems old and creaky like having to run the JVM to monitor Windows/.NET hosts and weird phantom alerts where we got woke at 2am. Dynatrace seems like a significantly more complete product and its frontend integration is stupidly good.

What also seems to work in Dynatraces favour is its per hour model vs AppDynamics per host for at least a year model. I can roll the licensing in with our AWS bills for pay-per-use, but I need to see about getting the incentives applied to our account. The only thing I can't seem to find is an insights-equivalent product.

I’m a bit more up on dynatrace than appd but a lot of it really depends on what you’re trying to do.

I’m curious if anyone itt has looked at Honeycomb/LightStep/Omnition.

freeasinbeer
Mar 26, 2015

by Fluffdaddy

CMYK BLYAT! posted:

The gently caress strange world do you exist in where developers are going to look at year-old operational metrics and glean useful information from them?

Most of the devs I work with can't remember the architecture of code they wrote a year ago, much less how it would impact metrics.

But at the same time they all want a year’s retention because it sounds cool.

The theory being that you could compare last year to this year, and that’s mostly around are we scaled enough.

tortilla_chip
Jun 13, 2007

k-partite
Is there not a generic data retention policy you can hide behind since discovery is a thing?

Zorak of Michigan
Jun 10, 2006


freeasinbeer posted:

But at the same time they all want a year’s retention because it sounds cool.

The theory being that you could compare last year to this year, and that’s mostly around are we scaled enough.

You don't need original live data for that though. You can summarize it in ever-larger batches and use the summary data for capacity planning. You want to know how Aug 2019 compared to Aug 2018, not how some random hour compared to the previous year.

LochNessMonster
Feb 3, 2005

I need about three fitty


Dynatrace is a lot more mature than AppDynamics and NewRelic. I’d pick it any day over the others.

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

uncurable mlady posted:

I’m curious if anyone itt has looked at Honeycomb/LightStep/Omnition.

my team is currently having an absolutely awful time with our logging solution and are considering moving to honeycomb. i wasn't with the team when they did the demo but from what i heard it was great tech that was just a little too pricey to consider switching to

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison

Blinkz0rz posted:

my team is currently having an absolutely awful time with our logging solution and are considering moving to honeycomb. i wasn't with the team when they did the demo but from what i heard it was great tech that was just a little too pricey to consider switching to

My understanding is that honeycomb is pretty inexpensive, but I don’t know their pricing model.

Hadlock
Nov 9, 2004

CMYK BLYAT! posted:

The gently caress strange world do you exist in where developers are going to look at year-old operational metrics and glean useful information from them?

Most of the devs I work with can't remember the architecture of code they wrote a year ago, much less how it would impact metrics.

If your product is highly db driven and business highly seasonal, the DBA is going to want annual metrics data to do capacity planning etc. My last company the product was mostly idle except for at the end of each quarter, plus a hot spot about 10 weeks before the end of the year.

Having that high usage-period data is really helpful for doing capacity planning etc

Also gently caress data dog. Prometheus 4 lyfe yo.

Methanar
Sep 26, 2013

by the sex ghost

Docjowles posted:

Is anyone here running Prometheus for metric gathering and querying? We use it on a small scale for monitoring our Kubernetes pods. We have some developers who want to start expanding the usage, exposing and scraping far more detailed metrics from their apps and keeping them around for an extended time (1 year+ vs the 4 weeks we currently have configured). The guy on my team who has spent the most time on Prometheus stuff is pushing back VERY HARD on this and I'm looking for other experiences. His main concern is that (according to him) it's not a tool designed for this purpose; it's meant to store and query only a small set of data that quickly ages out. So the whole app will quickly blow up and die if we, say, multiply the amount of stuff we're storing by 100x.

I also see that there are a variety of external backends and adapters you can use that are explicitly for long term storage, but he again feels they are bandaids and the core of Prometheus will never handle this use case well.

Is this true? Was it ever true? This has the smell of outdated "conventional wisdom" to me that was the case in Prometheus 0.1 beta or something but no longer holds. Like, there have to be companies out there with very large scale Prometheus deployments at this point. And running both Prometheus plus a whole secondary metrics pipeline (statsd or what have you) for anything you want to keep for months rather than weeks also seems insane. But I am a novice in this area so maybe I am entirely off base and he's right.

Very interested in anyone else's experiences here.

My new org operates what is almost certainly the largest prometheus/thanos deployment in the world.

Prometheus doesn't scale worth a poo poo so we do some serious aggressive sharding with basically one prometheus server per node that we use basically as a buffer before offloading to s3 where thanos actually is what we query trhough grafana.

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

I'm looking for a lightweight CI/CD system for personal use.

I use Gitlab at work, it's great, but it would be a bit of a hog on the raspberry pi I just ordered, and 99% of its features are superfluous for a single user.

From some Googling, Drone seems to be the most popular platform for this kind of hobbyist project. Has anybody used it?

freeasinbeer
Mar 26, 2015

by Fluffdaddy

NihilCredo posted:

I'm looking for a lightweight CI/CD system for personal use.

I use Gitlab at work, it's great, but it would be a bit of a hog on the raspberry pi I just ordered, and 99% of its features are superfluous for a single user.

From some Googling, Drone seems to be the most popular platform for this kind of hobbyist project. Has anybody used it?

The dirt cheapest CI I know of is google cloud builder. Should be fast and about as much effort to setup as drone.

Edit: Drones not bad, but they are basically feature equivalent. Both wrap simple yaml build pipelines that run every step in Docker. This model breaks a lot of folks minds I find, particularly when they find out they can build the binary without using a build container or a dockerfile that is more then a copy command of the final output.

freeasinbeer fucked around with this message at 13:20 on Sep 30, 2019

Cancelbot
Nov 22, 2006

Canceling spam since 1928

APM update: Loving Dynatrace and all it's tagging stuff, we're going into production tomorrow and using it in anger.

Also on Honeycomb: Our developers love the logging part and really want to buy it, they went to NDC and now want to explore structured logging in a big way. Shame the APM isn't great for .NET.

Cancelbot fucked around with this message at 14:25 on Sep 30, 2019

Adbot
ADBOT LOVES YOU

Rocko Bonaparte
Mar 12, 2002

Every day is Friday!
I'm trying to understand the GitLab Runner without a formalized training--particularly the configuration YAML file. It's been doing stuff in a way I didn't really expect. Most recently, I attempted to launch my tests in a two-stage process: setup and test. I defined these in a stages section like I saw in the documentation. The test section depends on the setup phase and runs much longer. It has commands in its after_script block that rely on collateral acquired in the setup phase. Just about everything ran as expected, but when it came time to run the after_script stuff, it couldn't find an application it had downloaded in setup. Actually, I should just try to paraphrase with an example:

code:
stages:
  - setup
  - test

setup_regressions:
  stage: setup
  script:
    - curl -fL [url]https://getcli.jfrog.io[/url] | sh
    - ./jfrog rt config --url=$ARTIFACTORY_URL --user=$ARTIFACTORY_USER --password=$ARTIFACTORY_PASS derp-pary
    - ./jfrog rt c show
  
vm_regressions:
  stage: test

  script:
    - python3 cicd/vm_regression.py --flags_out_the_whazoo

  after_script:
    - find /tmp/$CI_PIPELINE_ID -name *.box -exec rm {} \;
    - ./jfrog rt u /tmp/$CI_PIPELINE_ID/ fun_repo/gitlab/$CI_PIPELINE_ID/
    - rm -Rf /tmp/$CI_PIPELINE_ID

  only:
    - master
    - vm_hardening

This is edited from the original, sanitized and simplified.

I can see setup_regressions runs and jfrog outputs some stuff. It's definitely there at that point. However, the command is not found when it's run in the after_script section of vm_regressions. Why is this?

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply