|
poemdexter posted:A good example of this is a feature we might have wrapped in a feature toggle so we'd sorta wanna test it feature on and feature off in parallel manually. Is the feature on/off state compiled into your code or something? Another approach would be to set a default value, but allow the feature to toggle on/off via a config file, DB entry, API endpoint, or whatever you prefer. And then deploy the same artifact to all QA servers, but using config management / etcd / whatever you normally use to configure the app, toggle the flag on some boxes and not others.
|
# ¿ Nov 10, 2016 18:45 |
|
|
# ¿ May 3, 2024 16:33 |
|
Jesus. I'll be curious how that goes for you. What kind of EC2 instance, and Java heap/GC/etc settings are you using on the master? We have a much smaller though still busy Jenkins instance, and it falls over every week or two under normal usage. I haven't taken the time to dig into it because it's not a huge deal if Jenkins is down a few minutes a month, but I would like to stabilize it without just wallpapering over the problem with way too much hardware. edit: this reminds me I need to go make sure noone is running builds directly on the master again. For a while someone was running a horrible PHP-based job that ate all the RAM. I stamped that out but now I'm wondering if someone else has done it again, intentionally or not. Docjowles fucked around with this message at 02:09 on Nov 12, 2016 |
# ¿ Nov 12, 2016 02:06 |
|
Virigoth posted:We use a c4.2xlarge for the master. I should say it's 250 slave executors on other standalone Jenkins slave boxes. I'll look up data tomorrow and post it. I can also toss up a script that checks for jobs on master if you need any easy one to run and get s report back on a timer. Sure, that'd be super helpful. As well as any non-default JVM tuning you may have done. Although we actually are on Jenkins 2.x now so it may not be comparable. We upgraded in part in the hopes of better stability, but nothing's that easy. Our master's VM is comparable to a c4.xlarge (so half the horsepower) but coordinating WAY fewer jobs. Which is making me very suspicious when it crashes, since it really shouldn't be doing much work. All of our jobs are supposed to take place on slaves, too. But there's a few devs from Ye Olden Days of the company who just do whatever the hell they want, because that's how it was done in 2006 with (1/20th the number of engineers competing for resources), and it was good enough then
|
# ¿ Nov 12, 2016 02:39 |
|
Oh hey that owns. Thanks for sharing.
|
# ¿ Nov 23, 2016 16:20 |
|
fletcher posted:We use the ELK stack for logging and I ran out of disk space for elasticsearch way sooner than I thought I would. How can I tell which log entries (i.e. by hostname or something) are consuming the most disk space? Probably dumb question, but are you accounting for replicas? However many replica shards you use will multiply your index size, and could cause disk usage to appear crazy if you aren't expecting it. If it's not that, it might be easiest to examine the raw logs, if they're accessible. Which ones are the biggest per day? ES does have a _size field you can query on but it's not enabled by default. You could update your mappings with that and wait for the index to roll over to the next day, and see what shakes out. Failing all that, you can just make a quick visualization of log lines per day by hostname, which should be a decent proxy for size. Obviously that doesn't work if one log is spamming billions of massive Java stack traces and one is just a normal HTTP access log, but it's a start. I'm pretty aggressive about dropping any message fields I don't care about (like _all) to maximize my storage space. You can also tweak the index.codec setting, either globally or per index, to use slightly better compression. And finally, set up a cron with something like curator to drop indices older than N days.
|
# ¿ Jun 2, 2017 02:06 |
|
^ That guide "The ElasticSearch Definitive Guide" is actually a very good read. Even just the intro sections that describe fundamental ES concepts like node, index, cluster, document, shard etc. Regarding replicas, you might still have replica shards even if you only have one physical host involved. I think by default, ES will give you 2 replicas. You can check by doing a "curl localhost:9200/_cat/shards". A p in the third column indicates a primary shard. r indicates replica.
|
# ¿ Jun 7, 2017 14:42 |
|
fletcher posted:Sweet, thanks! Disclaimer: I am stuck on v2.4 of the ELK stack because it runs fine and I haven't had time to upgrade the world, so maybe this isn't true anymore Logstash should be creating a new index every day, named with the date. Something like logstash-2017.06.09. You can view your indices with a command like "curl localhost:9200/_cat/indices". So when I need to make a mapping change, I normally just suck it up and accept that things are going to be different prior to that change. I change the template, and when the index rolls over at midnight, all new data is indexed with the updated rules. I only keep a week's worth of indices since that's what I can fit in the resources available to me, so any old crap rolls off pretty quickly. If you are keeping months (or forever) worth of logs and want them to all be consistently queryable, then yeah, maybe you will need to get into the reindexing API to process everything again according to the new mappings. Basically instead of posting one-off custom mapping rules, you want to post a new template that includes ALL the mappings you want. Logstash has a default template, but it kind of sucks (again, as of 2.4). There's a setting on the elasticsearch output in Logstash to point it at a new template you create, and another to force Logstash to overwrite the existing template. Set those, define the new template, and then the next time the index rolls over it will use all the updated mappings. You can see the default templates here as a starting point to modify. Docjowles fucked around with this message at 04:11 on Jun 10, 2017 |
# ¿ Jun 10, 2017 04:06 |
|
Related question... if I am doing pretty standard ELK stuff, how much am I missing out on by remaining on 2.4? Any killer features or huge performance wins I'm a moron to pass up? Indexing a bit over 1TB/day between http access logs, exim mail logs, and random multiline Java barf. The whole thing Just Runs after a lot of initial tuning so I am reluctant to mess with it. But I don't want to get so out of date that updating is impossible and all documentation and guidance on the web no longer applies, either. My main gripes right now are 1) the dogshit error logging from Logstash itself (you either get nothing at all, or like 5 megabytes of stack traces with no newlines anywhere, GLHF) 2) Logstash will pass a configtest just fine, and then still crash on startup due to invalid config for many classes of errors 3) Missing out on Kibana innovation. The visualizations as of 4.6 are very limited as soon as you want to show more than one thing at a time. Plus random bugs that will never get fixed as the branch is EOL Which are annoying but not so bad that I want to spend like a week upgrading everything if I don't have to Docjowles fucked around with this message at 04:30 on Jun 10, 2017 |
# ¿ Jun 10, 2017 04:23 |
|
I have to assume that most people who manage non trivial Jenkins deployments hate it. It's a pain in the rear end for all the reasons cited already. Plus performance can get ungodly slow, though someone linked a blog post on Jenkins GC tuning a while ago and you are my hero for that because it was amazingly in-depth. The problem is that the list of strictly better open source projects out there with comparable feature sets is as follows: Uhhhh... *Beavis and Butthead laugh* Yeah.
|
# ¿ Jul 28, 2017 02:45 |
|
Blinkz0rz posted:I'm on my way there right now if only the T would run a little faster. as a Boston resident, I have bad news for you~ I went last year but had a conflict this week unfortunately and can't make it. Hope there's some good sessions! I always enjoy DevOps Days events, been to Denver and Boston so far.
|
# ¿ Sep 18, 2017 13:53 |
|
I think maybe you want to create an MST "transform" file to go along with your MSI.
|
# ¿ Jan 8, 2018 14:50 |
|
Yeah there is just way too much industry momentum behind k8s for any one company (besides Google) to dominate the conversation, IMO. Whatever Red Hat's social media manager might have to say about things. They'll steer some things but they don't "own" the project anymore than they own the kernel or OpenStack. edit: I agree with you on the random bad taste in my mouth from past Red Hat sales pitches. At my last job the guy that came in to sell us on RHEV was a complete douchenozzle and he scuttled his own deal almost immediately. I don't hate the company top-to-bottom though based on one awful experience. They've done so much for Linux and open source over the years, even if it's lined their pockets at the same time. Never used Openshift, no opinion there. Docjowles fucked around with this message at 04:25 on Feb 1, 2018 |
# ¿ Feb 1, 2018 04:19 |
|
We ended up deploying openstack instead so that wasn't the only questionable decision involved! And idk, if the person whose job it is to describe the merits of their product can't actually do that, and instead spends a couple hours talking about how everyone else's product sucks, it's not exactly getting the vendor relationship off on great footing.
|
# ¿ Feb 1, 2018 23:10 |
|
freeasinbeer posted:Also Redhat clarified their plans for CoreOS and said that people were misreading the announcement. It looks like CoreOS lives!(for now at least) A couple threads here got confusing for a minute til I realized you changed your forum handle and avatar
|
# ¿ Feb 3, 2018 02:54 |
|
freeasinbeer posted:One was foisted on me, other I figured I’d dump a name I made when I was 12. Fair. The name I used when I first got on the internet (in the 90's, as I am super old) was extremely embarrassing and I'm glad I changed over to just pretending to be a fat old guy instead.
|
# ¿ Feb 3, 2018 03:32 |
|
You can read the Phoenix Project in like two days so it’s not a huge investment in any case. It’s definitely a book for managers trying to understand why traditional IT service delivery sucks rear end, or people who want to formulate that same argument for their own managers. It won’t teach you a drat thing about containers or kubernetes or Jenkins or infrastructure as code. But it might help you understand or explain why they are cool and good. I think it’s short, entertaining, and insightful enough to be worth a read. It’s literally The Goal retold for IT, if that helps place it.
|
# ¿ Feb 17, 2018 01:22 |
|
That thread owned and I’d love to see it revived. VC was totally carrying it, though, and had a kid which tends to mean the death of things like time to “read” and “think”. I’ve been reading the Google SRE book basically since it came out between caring for two kiddos and somehow still haven’t finished it. I did take a detour to read The Manager’s Path which is very very good, and great fodder for that thread if it does rise from the grave.
|
# ¿ Feb 17, 2018 01:52 |
|
Vulture Culture posted:I changed my job title on LinkedIn from Lead Site Reliability Engineer to Engineering Manager last month because of how untrue this is. Yeah I was gonna say, what I am seeing in job postings doesn't really bear this out. Companies are more than happy to spam out listings for ~* SRE *~ that are exactly the same thing traditional Ops people have been doing since time immemorial. Just like they would for DevOps Engineers before that. The SRE model is cool and good. If I don't have to pass a pretty legit software dev interview to earn that title, maybe that's not actually the model your company is using. StabbinHobo posted:being interrupt-bombarded with broken poo poo that somehow the onus is on you now for This is too real. I am ready to come hang out with you at your local Docjowles fucked around with this message at 03:57 on Apr 16, 2018 |
# ¿ Apr 16, 2018 03:52 |
|
freeasinbeer posted:I’m spoiled because with GKE it’s a checkbox that grabs stout and stderror, behind the scenes it’s running fluentd and scraping the pod logs. Yeah this is what we do (self-managed cluster on AWS built with kops). Containers write to stdout/stderr, which kubernetes redirects to /var/log/containers/ on the node. There's a daemonset running fluentd on every node. It tails all the logs and sends them to elasticsearch. Not much to it.
|
# ¿ Apr 17, 2018 15:07 |
|
Methanar posted:How does everyone do their source control for Kubernetes and interaction with Kubernetes. Mao Zedong Thot posted:makefiles and yaml Hey I'm here to post the same question as Methanar and see if anyone has a different answer We've been doing a POC with kubernetes and have determined that it owns. But going from "a few engineers dicking around with no revenue on the line" to "production environment shared by a bunch of devs across a bunch of disparate teams, some of which are subject to government regulations" is quite the leap. Even in our simple test environment we've had people accidentally apply changes to the "production which is thankfully not really production" cluster that were meant for staging. Or do a "kubectl apply -f" without having pulled the latest version of the repo, blowing away changes someone else made. This is completely untenable. We easily could have a Jenkins job (or hell even a commit hook) that does the apply command and that would cover most cases. There are certain changes that require extra actions. But we could special case those. But it seems like there has to be a tool for this already because doing it manually is so janky and horrible. And I know companies way bigger than mine are running Kubernetes in production. Is that tool Helm? Something else? I agree Helm doesn't sound ideal.
|
# ¿ May 2, 2018 04:19 |
|
jaegerx posted:Sadly I think you probably want openshift if you want developers going right into your cluster Seeing as I know everyone hates Openshift... what would you recommend as an alternative approach? We have a bunch of dev teams who all want to be able to deploy apps whenever. These are not microservices, but are mostly at least web apps written in modern-ish Java and actively maintained. If app behavior needs to be changed to work in a containerized world, it can be. We aren't trying to forklift some awful 1980's ERP thing. We have a pretty reasonable setup today where devs can deploy their applications whenever onto traditional VM's without involving operations. And can make relevant configuration changes via pull requests against our Chef cookbooks, with permission to deploy them once merged to master. But it's still slow and clunky and a waste of resources and ops ends up as a bottleneck more than we'd like. So now we've set up kubernetes which has a lot of advantages. But somehow ops applying changes is still a loving bottleneck and we need to fix that. We've gotten to the point where devs can update their containers with new code whenever and deploy that no problem. But any configuration change (deployments, services, configmaps, etc) is still manual and error prone. We are still early enough on that we could change almost anything if it turns out what we are doing is terrible. Docjowles fucked around with this message at 06:19 on May 2, 2018 |
# ¿ May 2, 2018 06:07 |
|
Methanar posted:I can't stop laughing at your avatar. Haha. This is more or less what it used to be before some dipshit bought me that awful anime avatar last month. Finally decided to drop the :5bux: to fix it. Thanks for the input, everyone. It's both comforting and disturbing that everyone else has the same problems and no real solutions.
|
# ¿ May 2, 2018 19:16 |
|
Hey Kubernetes friends, I'm here again to ask how you are solving a few more unsolved problems 1) There is an issue causing DNS lookups from within pods to frequently timeout, which is bad. This is tracked in a whole bunch of places as they are still trying to figure out who is responsible for fixing it. The root cause appears to be a very low level issue in libc or even the kernel so for most of us it's going to be "who has the best workaround" unless you want to run bleeding edge packages on all your nodes. Most of our apps aren't too bothered, but we have one lovely PHP app that is getting absolutely destroyed by this. Which is unfortunately responsible for a lot of our revenue, because of course it is. Some light reading: https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02 https://github.com/kubernetes/kubernetes/issues/62628 https://github.com/kubernetes/kubernetes/issues/56903 https://github.com/weaveworks/weave/issues/3287 (problem is not specific to Weave, but we happen to use Weave) Have you seen this too? And if so, how are you mitigating it? 2) Authentication. We're currently doing simple certificate based auth. You authenticate with a cert, and then get authorized to do or not do stuff based on comparing your cert's CN to the defined roles and rolebindings. Which works ok. But curious if anyone is doing anything fancier or that you just like better. We're at the point where we want to start writing automation for access requests and want to make sure we're automating a good process instead of just speeding up trash. My main issue with the cert approach is that kubernetes doesn't appear to support revoking certs/checking a CRL. You can effectively block a cert by removing its CN from any rolebindings, but that still doesn't give me the same warm fuzzies as totally blocking the cert in the first place. Docjowles fucked around with this message at 03:47 on May 8, 2018 |
# ¿ May 8, 2018 03:39 |
|
It can also depend how many teams you're talking about, how big they are, and how busy Jenkins is. Whoever initially set up our Jenkins master did one for the whole company. Which was probably fine at the time. Then we grew a bunch and there's like 15 teams all sharing it and the master is almost always executing jobs. This makes it a loving pain in the rear end to do upgrades/maintenance without making someone mad that we interrupted their jobs. Or someone decides to install a plugin and it breaks some other random team's poo poo. We're looking at splitting that up now into per-team masters, and having the tech lead for each one own its care and feeding.
|
# ¿ May 9, 2018 21:52 |
|
Well the idea is that it goes from one Jenkins I am responsible for to a bunch of Jenkinses individual teams are responsible for. We provide a platform and then the teams are delegated access to do what they need on it. But I take it I'm doing something very wrong here so am open to suggestions. I'm trying to do the neighborly DevOps thing here. We get an disproportionate number of tickets requesting changes to Jenkins, upgrades, new plugins, new nodes. Everyone wants their change now. Yet if it's down for 10 seconds HipChat starts blowing up with "hey is Jenkins down for anyone else?!? Are Jerbs aren't running" comments. I want to get out of the business of managing Jenkins. Unfortunately it's also critical to the business and a shitton of jobs have built up in there over the years, so just switching to something better isn't possible overnight. How do you all deal with this? Features of the paid Cloudbees version? Schedule a weekly maintenance window and tell people "tough poo poo, wait til Wednesday nights, and at that time the thing will be restarted so don't schedule or run stuff then"? Some other incredibly obvious thing I am missing?
|
# ¿ May 10, 2018 02:04 |
|
Embrace the giant red box of security shame whenever you click Manage Jenkins Warnings have been published for the following currently installed components (list of all currently installed components follows)
|
# ¿ Jun 24, 2018 04:10 |
|
It's a tough thing to balance. We had a dev team successfully politic their way to their own AWS account with full control over it, minimal oversight from the operations team. At some point they requested an increase in the EC2 instance limit to 1,000 for like 10 different types in multiple regions. Then someone with admin access committed his access key to GitHub, and we spent $17,000 on buttcoin miners in about 15 minutes (they hadn't enabled 2FA, naturally). We caught it immediately and shut it all down, and AWS thankfully refunded most of that. It was great. They do not have YOLO-tier access anymore. I'm very receptive to the argument that teams need access to spin up and manage stuff. If you aren't going to let them move fast, why are you paying a huge premium for the cloud? But definitely put some guardrails on. There's a medium somewhere between old-school Ops as gatekeepers who say no to everything, and giving the root login to people with zero security or operational background. Interested to hear how others have threaded the needle, too.
|
# ¿ Jun 28, 2018 15:03 |
|
edit: double post
|
# ¿ Jun 28, 2018 15:03 |
|
You certainly can request limit decreases. I did it after Team Chucklefuck raised all their limits to 1000 for no reason and got hacked. Just open a support case and ask them to do it. It’s not one of the choices in the drop down but I just filed it as “general account question” or something and they took care of it. I had support set the limits to 0 in regions we don’t and probably will never use, and something closer to the number of instances we actually run in the regions we do use. You can always raise them again if needed.
|
# ¿ Jun 29, 2018 11:58 |
|
We have also been using Artifactory for years and it is fine. Other than “it costs money” I don’t really have a single complaint offhand. It’s shared by like a dozen teams pushing and pulling every possible type of artifact and just works. Compared to our self hosted Atlassian poo poo it is a model of stability. Which I guess isn’t saying much. Docjowles fucked around with this message at 05:56 on Sep 13, 2018 |
# ¿ Sep 13, 2018 05:49 |
|
Mostly the JVM running out of memory lol. We tune the heap and GC settings and eventually give in and give it more RAM, but it always seems to expand to fill it and the OutOfMemory exceptions return. It’s not like every day but enough to be annoying and disruptive.
|
# ¿ Sep 14, 2018 12:49 |
|
Helianthus Annuus posted:Good answer. OP, if you can't readily switch jobs, I advise you to read up on these devops concepts and figure out ways to apply them to problems at your current job. If you can come up with a plan to improve something and execute it, that's a great story to tell at your future interviews. This is basically how I learned puppet/config management years ago. I had started to hear about this stuff from various sources and it sounded very interesting, but had no hands on experience whatsoever. A project came up at work where I needed to configure a whole bunch of nearly identical Linux boxes and I said gently caress it, I can totally justify learning Puppet on company time to knock this out. The project was successful, people loved seeing the power of automation vs hand configuring a bunch of crap over and over. I got to learn a hot new thing (and then parlay that into a new job at a 50% raise shortly after lmao, though I had no such Machiavellian scheme when I started out). I’m not outright advocating Resume Driven Development here. Don’t just set up kubernetes to run your company’s old rear end Java monolith for shits and giggles. But if you can find some way to get what you want to learn into a project where it actually makes sense, by all means push for that.
|
# ¿ Sep 29, 2018 01:45 |
|
Bhodi posted:what the hell is 12 factor, do i have yet another bullshit invented term to learn? jesus, i hate computers Are you vaguely aware of how to write and operate modern applications, where modern is like TYOOL 2012? It is that. https://12factor.net/. Plus the usual "make your app stateless!!!" "ok but the state has to live someplace, what do we do with that?" "???????????" Which is not at all to say that you should stop hating computers. gently caress computers. Methanar posted:I know of a company that doubled down on Chef so hard they built their own monstrous artisinal distributed chef cluster to handle tens of thousands of nodes being provisioned at once, multiple times a day, rather than use pre-built AMIs, or something else immutable. I see you've met my coworker. Guy has a simple decision making process. Is $thing written in Ruby and Chef (ideally by him)? It is good. No? It is utter trash and needs to be reimplemented in Ruby and Chef, by him. Corollary: if it was written in Ruby/Chef but not by him, it sucks, and needs to be rewritten by him. I swear 95% of this dude's tickets are "rewrite cookbook for no goddam reason", or "build tool nobody asked for". Opened by himself. edit: Bhodi posted:don't doxx me I mean this is basically how we got k8s into our org at my current job, except it was from the Director level instead of a rando ops engineer so we actually had power to tell people to rewrite their poo poo to not monopolize an entire worker node with one 64GB Java heap container that cannot tolerate downtime ever Docjowles fucked around with this message at 04:18 on Sep 29, 2018 |
# ¿ Sep 29, 2018 04:15 |
|
Give me the billion dollars. I promise to at least set up PXE boot.
|
# ¿ Oct 9, 2018 02:45 |
|
FWIW Terraform is, loving finally, starting to introduce some more legitimate looping constructs instead of that "count" garbage. https://www.hashicorp.com/blog/hashicorp-terraform-0-12-preview-for-and-for-each
|
# ¿ Oct 24, 2018 20:19 |
|
Gyshall posted:I look forward to this being fully documented in Terraform 0.76. I'm enjoying that it was announced in July and still has not been released Not to mention all the caveats in the post about how the initial feature will be very limited and will grow over time.
|
# ¿ Oct 26, 2018 02:53 |
|
I wouldn't hold my breath for Terraform to hit 1.0. It's been developing at an absolutely glacial pace lately.
|
# ¿ Oct 26, 2018 18:48 |
|
I bought it last night, so maybe I can post some initial thoughts before the deal expires The only one I already owned was the followup to the Google SRE book, which is good. I highly respect the authors of several others (Kelsey Hightower on the Kubernetes book, and Charity Majors on the database one, for example) so I expect they will be good, too. Seems like a great value, especially if you can get work to expense it!
|
# ¿ Nov 6, 2018 15:41 |
|
Yeah I am going to reinvent. Went last year as well. Looking forward to a day or two of ambitiously waiting in huge rear end lines before getting all and switching to day drinking with a pinky swear to watch the sessions later online.
|
# ¿ Nov 25, 2018 01:53 |
|
|
# ¿ May 3, 2024 16:33 |
|
Current re:Invent status: Waiting in an hour long line to even register for the conference. Going to miss my first session. No food or coffee because everyplace that serves those also has an hour long line. My coworker went to a different venue, was registered and had coffee in like 10 minutes. Currently researching the legality of murder in Nevada.
|
# ¿ Nov 26, 2018 18:32 |