Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost

Mr. Crow posted:

The DevOps 2.0 Toolkit: Automating the Continuous Deployment Pipeline with Containerized Microservices https://www.amazon.com/dp/B01BJ4V66M/ref=cm_sw_r_cp_apa_sQjfzbWBC57FV

The space is moving very fast but it's pretty up-to-date.
It's depressing to see so many places that will have basically none of the above because they're culturally so far gone that even the idea of deploying faster makes leadership scared "why would we want to deploy faster? going slower means it's more reliable!" I see so much backlash against anything that fast-moving companies do while people working in really low performing enterprise shops act like their jobs deploying J2EE apps almost exactly like it's 1997 are so hard compared to the things modern software companies do.

Adbot
ADBOT LOVES YOU

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
I have to disagree, mean time to recovery is very well understood even by old school curmudgeons because the big bureaucratic places have been getting familiar with disaster recovery scenarios where RPO and RTO figures start to matter because it helps determine how much money is at risk and what they're willing to spend on an insurance policy basically. Most of the old school operations management folks in the US come from backgrounds in military or manufacturing in terms of culture and primitives because these are among the most well studied academically. This isn't to say that these aren't applicable to tech in any fashion; Tim Cook's background is manufacturing and supply chain operations and logistics, after all.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
It's fine to Dockerize a database as long as you don't expect it to last a long, long time with its data or have a solid grasp on the data volume's lifecycle (see: Postgres K8s operator), but for most people in production with big ol' clusters and such that doesn't apply. I use Docker containers for launching temporary databases in CI builds and to compare / contrast different configuration settings for different use cases.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
Managing data volumes / persistent volumes to work around the issues around the ephemeral-by-default nature of overlay filesystems, it's just less risky to deploy a mature RDBMS like MySQL or Postgres without the additional overhead of Docker containers involved. There really doesn't seem to be much advantage to containerizing Postgres and MySQL besides a single consistent deployment tool.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
Containers do not replace configuration management. However, it lets you get to a place where you can at least separate application configuration and artifacts from operations concerns such as logging, monitoring, kernel tuning, etc. The model also makes it easier to enforce 12-F applications instead of allowing lazy developers to dump files all over a filesystem.

One new problem that arises is that now containers can become quickly outdated if they're created in a typical lazy manner that includes way more dependencies than necessary (all those lazy people using Ubuntu base images are scary, man). However, you can very quickly update your container hosts (may God have mercy on your soul if you're not using container orchestration) and many patching tasks becomes specific to application containers. This greatly reduces the burden of change management off of operations teams and helps achieve higher density and separation of containers. For example, you can reschedule containers to a set of nodes that are isolated from known updated containers.

I still use Puppet and Chef to provision and maintain my worker nodes that host my containers.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
Private Docker registries can be fairly easy to setup provided you have certificates that don't suck (read: self-signed certs that you have to add to clients is a pain in the rear end). I'm hosting one at work as an nginx container terminating the SSL connection and proxying to the Docker registry container. These are on a single instance in an ASG with sizing of 1 backing to an S3 bucket. Pulling images out of AWS can slightly suck in costing if you have a lot of developers pulling locally, but if you're primarily pulling from within AWS it's pretty nice.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
I didn't want to have even a chance of the container being accessible on the public Internet and since ECR doesn't support provisioning inside a VPC last I saw, that was a no-go for me. The EC2 instance + EBS probably costs more than what ECR would cost us probably, but with a $280k+ / mo AWS bill from gross mismanagement (110 RDS instances idling 95% of the time, rawr) I'm not being paid to care about cost efficiency anymore.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
You do not want this management. At all. Our infrastructure is horrific and even as a consultant across public sector and about half the Fortune 100 I think this is in the bottom 10% of organizational capability around infrastructure management. While I don't have barriers organizationally from me doing a lot of things other places would care about, there's so many other problems from lack of organization that it's a different form of paralysis.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
Most places I've seen that bitch about cost treat AWS like a datacenter without even looking at options like reserved instances, S3 durability and redundancy tiers, and bandwidth contracts (yes, they do have them to help lower egress costs substantially, mostly of use when you get to petabytes / mo in transfers).

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost

Blinkz0rz posted:

How does this work?
I swear I saw this on an AWS page several months ago but tried to find it again repeatedly in vain. So, it's possible it was offered at one point and pulled, be warned. The gist I saw was this: you pay for an annualized contract on reserved bandwidth for your account where the first discount tier starts about 10 TB / mo at maybe 5%. For us, our recorded outbound bandwidth was measured about 5+ PB annually. The discount was somewhere around 15%+ at that point with the fields I saw, which may have been sufficient for us to reconsider the plan. AWS regularly offers its largest customers really, really substantial discounts on its rates. I've now worked at the #3 and #5 biggest customers supposedly in terms of AWS spend and when you're talking $30MM / mo+ you can ask for a lot of discounts on instances and services but network seemed to be off the table for us.

I really swear I saw such a document because I was about to march into the engineering director's office with it to argue against the egress costing argument (the prime mover away from AWS) to avoid having to deploy our junky flagship product into an expensive Openstack cluster and tie us to a datacenter.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
Changing the AMI in the launch configuration of the ASG won't rotate the instances if that's what you're trying to do automatically. You can use CloudFormation to use an update policy that will rotate your change through. It's usually better in my experience to treat ASGs as immutable and to launch new ASGs with launch configurations so it's easier to attach and detach groups of instances atomically to an ELB.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
As a user of Jenkins for nearly... 6 years now (before when it was known as Hudson), I can't really say it's great for a lot of software projects besides really bloated enterprise ones where you're totally cool writing more code for the sake of automating more weird things in your build process. A lot of the plugins really piss me off (the SSH agent plugin doesn't support the "new" ssh private key format still, and the errors are completely bizarre).

Jenkins' scripted pipelines and declarative pipelines are where most competitors went to philosophically (think Travis CI) but Jenkins is still the same under the hood with that horrific XML based configuration that defines what is ultimately a freestyle job. That technical baggage is of little value to you as a user and ultimately undermines the experience. You can use stuff like Jenkins Job Builder but all the wrappers never can get away from this harsh reality, but after writing like 50+ Jenkins pipeline scripts of both types I'd rather just use Jenkins Job Builder https://docs.openstack.org/infra/system-config/jjb.html https://docs.openstack.org/infra/jenkins-job-builder/

Jenkins requires a lot of up-front investment - more than most other CI options nowadays - and 90%+ of the time on a job I'd rather just build artifacts with either a bunch of shell scripts executed (as developers have locally since the dawn of time) or with some time-saving opinions. I really don't like picking and choosing my buggy plugin-of-the-week for test result parsing or how to install tools.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
One of the recommendations I read from the Cloudbees folks is that you should be using Jenkins pipeline jobs to orchestrate across nodes primarily while your Jenkins freestyle jobs and your build systems like Rake, make, Gradle, Maven, etc. are what should be the fundamental build unit. Given how easy it is to show several jobs in parallel in a pipeline job (I still dislike the trigger and blocking conventions) that seems to make some sense and it also shows what the developers' priorities have been in the design of pipelines.

Vulture Culture posted:

Is this still your take with the Blue Ocean stuff, or is this take solely restricted to old-style pipeline management? I'm looking for a decent CI setup for Chef and other infrastructure code.
As said above, Blue Ocean is a separate view layer for Jenkins requiring use of per-branch Jenkinsfiles that takes the place of the various terrible looking build pipeline dashboards out there. In vanilla Jenkins, I spent a few days trying to get some installed to show several build pipelines at once on our TVs in the office and ordered by different criteria and I just hated the experience so much. Unfortunately, Blue Ocean is also super limited because it only worked for Maven builds and Junit, so it appears to be highly coupled to the tooling, and this is a problem if you're looking for more flexibility, which should be 99%+ of Jenkins users out there now I hope. It is total demo-ware and pie-in-the-sky for the realities of my horrific warcrime tribunal of software at my employer.


Also, I think Jenkins will be less and less relevant outside all but the largest companies as people start using containers to serve as build environments so you can start using CI like Drone.io (or something similar but hosted). I freakin' hate trying to setup auto-install tooling in Jenkins and am tired of spending weeks trying to setup rvm and rbenv and pip w/ virtualenv and whatever other build system will fail to work in Jenkins like it does on a developer laptop while people are confused why it takes so much effort to make jobs in Jenkins.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
Code re-use for devops config type stuff is so much worse than in regular software it's mind-boggling.

Has anyone found a decent Terraform module or plugin for generating CloudFormation blueprints? How about a Terraform plugin that can do multi-stage deployment variants like swapping ASGs, blue-green deploys, slow roll-outs, etc.? I don't think even enterprise Terraform does this stuff yet and that bothers me more than it should. We have a tool that's basically a ton of horrific ERBs cobbled together loosely with our Puppet modules and it generates CF blueprints while half-assing some deployment methods (but better than what Terraform has, sadly), and I'd like to have a migration path off of it to Terraform while preserving our existing templates to some extent.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
I used Troposphere for some personal demos in the past but never thought it'd work particularly well for a larger codebase going forward and never tried to adopt it professionally. That and most of the other opsy people here aren't coders fundamentally and it's usually easier to push something that has more documentation (that and I don't want to answer Python questions all day). I totally forgot that Troposphere works for OpenStack though so it may work enough for me to use that for service stacks instead of Terraform. So that is Terraform for infrastructure stack, and each service's stack could be maintained as testable Python. This makes too much sense so it won't fly here I think.

But really, I can't do basically 90%+ of what most shops on AWS do for various reasons ranging from "we're moving off of AWS eventually, don't give them more money" to "we seem to make the literal worst possible technical decisions but somehow keep the lights on." For example, each of my deployments I have to file a DNS change ticket and wait for someone to change the A records and CNAMEs. All the problems of classic IT, few of the advantages of cloud.

Blinkz0rz posted:

Also, don't do scaling groups with Terraform. Consider it this way, Terraform is how you set up your immutable-ish infrastructure; it's where your IAM roles, security groups, load balancers, S3 buckets, and the like go.
We're familiar with some of the scalability limits of CloudFormation because of how our stack generators work. We ran out of stack outputs due to how we were giving a subnet for every service in a 15+ service system, for example. But at least CFN supports rollbacks... usually.

I've got Spinnaker and Urbancode Deploy POCs on my roadmap out into next year but there's no point if the software is super stateful like the mess here. Postgres as service discovery, wtf. I'm about to use this https://github.com/adrianlzt/hiera-postgres-backend :smithicide:

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
I know about them and know how to use them (I've designed single-button deploys of 20+ ZK node clusters with 500+ clients before for Hadoop). The developers.... do not and have no interest in doing so. Management has nothing on the roadmap with the words "service discovery" in it at all.

I'm trying to cloud-wash / port a legacy system that keeps track of system configuration in the same database that business transactions happens (it's a Grails app that's grown to monstrous size as a batch and event processing system frontend but nobody learned anything other than Grails and other web transaction stacks for 8 years). There's a unique table for each service type and when you boot a new node, it calls to the uber-database, adds itself to its respective tables upon bootup if it doesn't find its self-assigned ID in its rows, and ops customizes its always-unique, customer-specific configuration in the UI. The way to automate replacement of an existing node is upon boot run a DB transaction to find the primary key of a node with the roles that you're replacing, delete the old node's row, and modify the primary key matching the node that just registered to the one deleted (the UUID is separate from the row ID).

These are all solved problems with ZK and friends but I think it'll be a cold day in hell before we get around to service discovery of any sort, so this will be a long-term approach that works at the "scale" here.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
So back to the earlier question of CI / CD options besides Jenkins, there's one I forgot about that I think is worth looking seriously into partly because of the CloudFoundry heritage - Concourse CI. It even has a nice comparison for what's wrong with Jenkins, GoCD, TravisCI. It's "resource-centric" and builds are natively done in containers, so that side-steps some of the problems with Jenkins but can be a downer if you require a Lovecraftian enterprise horror dependency like a mainframe or dongles (if someone can figure out how to virtualize hardware USB dongles in AWS or GCP let me know). The best part though is that you can run your builds locally before you make an idiot of yourself constantly hitting build and watching for syntax errors. When builds are a set of resource declarations with their dependencies linked it's a lot like Puppet or Chef, and this is probably a more natural way of defining a build than a linear sequence of steps and some hodge-podge of parallel steps. There's a lot of custom resources including Terraform. Feature comparison page

Vulture Culture posted:

I expect exactly this, with a few Mad Libs substitutions, in basically any Puppet environment nowadays.
Well yeah, Puppet and Chef are mostly useful for very stateful systems that shouldn't have nodes go up and down frequently and they're real awkward for elastic systems. I had enough problems with this with Chef node registration and de-registration.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
Chef-solo and masterless Puppet are just superpowered shell scripts if you're going with immutable infrastructure / golden image semantics instead of continuous configuration management and they still have their place (we have a two-pass image build with Puppet and have launch configurations set Hiera variables which gets you a whole new codepath that's not tested, of course). When you're trying to avoid re-deploying a bunch of containers to patch them or to tweak a value to propagate to certain systems you could do it with applications written to watch configuration like Zookeeper or Etcd, but most developers are bad at dynamic configuration. You could use stuff like Netflix's Eureka that seems to have some capabilities of re-configuring applications on-the-fly but deploying a new foundational stack seems drastic (I'm using Dynomite at work for a greenfield project and that's the impression I got with how Dynomite Manager manages node configuration).

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
"DevOps" in the purely engineering skillset sense is the work most ops people don't want to or can't do and what most software engineers can't or don't want to do. Even among Googlers and Facebookers it's not like people want to be on-call ever for even a pay bump less than < $20k / yr (this is mostly for an SRE type role rather than a backoffice engineering efficiency type org where conditions tend to be better). Most places' builds and deployments are utterly crap and being asked to fix it is probably a fool's errand without a lot of managerial support and incentives to do it (unless the reason for releases sucking is entirely "we never had anyone that knew how to do it" rather than the typical "we have bad development practices and sling code until the last minute and throw it over to someone else" situation that exists in most companies larger than 50 people).

I picked this route out of expectation that most organizations would wisen up and rally around releasing faster and testing better because the economics would work out that companies doing bad practices would be wiped out. I chose wrong.

necrobobsledder fucked around with this message at 23:04 on Aug 4, 2017

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
There's a rainbow of different kinds of titles that seem to be oriented primarily around the axis of "how close to dealing with the toil of classic sysadmin-style operations work do you have to do day-to-day?" but all of them should be treating sshing into machines to fix random problems that show up like it's cancer.

Nobody ever has a title called "Agile engineer" or "Program Engineer" or an "Agile Group" but boy are there tons of jobs with "devops" in the name. They do seem to be universally from big, bureaucratic companies that are not respected by software professionals. Most other organizations use terms like "Infrastructure Engineer," SRE, and "Cloud Engineer" that are actually more generic-sounding and reflect that they are a type of software engineer moreso than a computer janitor that you throw your stuff farted out by Jenkins at and go back to a world where all you need to care about is if your code passes tests and builds in Jenkins and you might get a ticket later that your code is segfaulting in prod.

Pollyanna posted:

And drat near everything wants AWS experience now.
Mark my words, AWS is the Java of infrastructure.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
I see it as far more than just infrastructure, but 99%+ of the huge companies I've seen utilize it as a big part of their strategy will really never be able to think of them as more than IaaS. Most of these CTOs have never worked for a tech company ever in their entire career I've seen. That's a big deal.

I've worked with a handful of the top 10 largest AWS customers so far ($10MM+ / mo is the smallest) and they're treating it like datacenters and deny-by-default service offerings from a separate software vendor like MS, Oracle, etc. that must go through purchasing and so forth. It goes through review rightfully so but taking 1 year to approve Lambda and denying EFS is just sillypants. The deployment patterns I'm seeing still are lift and shifts taking 3+ years and rewrites that will take another 5+, so in 8 years they will be where most companies that use cloud somewhat right will have been 5 years ago while the technology gap between laggards and fast movers has grown larger.

Crazy thing is that datacenter management is so bad at the massive companies $120MM/year for a couple hundred applications is a massive bargain. When you burn $2Bn / year on datacenters for worse uptimes and worse performance than AWS even if you let instances sit for 2+ years of uptime you're just bad at tech full-stop.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost

StabbinHobo posted:

you can consult on these projects drat near in perpetuity, but all you've gone and done is spent 5 years making decent money off of dying/failing orgs.
Trying to save those that are drowning can cause your own drowning even if you're experienced and trained. While I know plenty of people make great money (better than a staff engineer at a big tech company is my barometer) doing it, I'm not one of nor will ever be one of them for various reasons.

Thermopyle posted:

I mostly use lambda nowadays, but I'd be interested in hearing what else you're talking about here as I don't really follow this space much.
There's a ton of service offerings that fit the "80%+ of users' needs" featuresets on both AWS and GCP beyond "make me servers and networks." There's entire IoT device management / identity platforms, service desks, machine learning based analysis, etc. With such offerings all based around pay-as-you-go models, they can potentially create Salesforce-like businesses with every other thing they put out (not that they are, but iteration / MVP blah blah). The organization I'm at now is a victim in part due to AWS commoditizing what used to be "secret sauce" (admittedly, it really wasn't ever that hard IMO). And with so many companies being built on top of AWS now, it makes it super easy for Amazon to acquire companies and integrate them into the future.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
S3 and EFS are your primary options for large file storage. I wouldn't use EFS unless you are only temporarily storing them due to its steep cost compared to S3, but EFS is substantially faster in throughput (S3 has maxed out at 75 MBps / file for us with 16 concurrent multi-part download transfers running, EFS can go above that easily). It sounds like you'll need some form of a content management system if you're trying to do more than give a UI to technical people. Heck, the product I work on is specifically built for people having trouble managing large multimedia files and pushing them to distributors. There's expensive stuff like T3Media / Wazee built for large content providers like sports networks but that sounds like it's overkill. Really, anything more than Cyberduck or whatnot and you're starting to get into CMS territory anyway.

I'd setup an AWS lambda function to handle triggering of uploads and checksums, use EFS for temporary file storage to make checksums faster, run instances, use SQS for task queue, and setup a basic page hosted out of S3 for the UI, and store file checksums in S3. If it's not that high volume of users, your primary costs would be the instances for mounting the EFS CIFS / NFS endpoints and performing checksums. Only reason I'd go here is because Lambda functions that run a long time can get pricey and because transferring files out of S3 can be really slow on a single-threaded Lambda function (I don't think you are allowed to spawn like 16 threads to speed up S3 downloads as multi-part downloads).

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
Using volume mountpoints in a container (read: persistent state) is accepted as long as you know how to manage them effectively. Most of the critics are talking about production situations when it comes to containerized applications and services.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
Wait, Maven plugin AND Jenkins? Depending upon the Maven plugin used for Docker (some are substituting variables into Dockerfiles, others flat out generate them last I saw) you’re opening up Pandora’s box or signing up for cancer.

You may want to try to stick with a Jenkinsfile that will launch a Maven or Gradle target / goal and nothing more to contain the configuration drift problems that tend to happen with Jenkins installs.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
There’s a doc by Wilson Mar on the DSL that I found useful in the past. https://wilsonmar.github.io/jenkins-plugins/

I know people have had the most success with Jenkins by simply using shell scripts in the repo. It makes it tougher to override some tool paths, but I wrote up a list of shell variables that will be available from Jenkins and life was good. Only thing left is to figure out a way to let developers control the container used to build their project for matrix builds.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
Write a systemd timer or cron job to run Puppet and if the run fails fallback to a script that reverts it without Puppet involved and also fire off an event noting that your automation has failed and that you should be ashamed.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
Anyone have suggestions on determining alert threshold metrics on a classic ELB for a normally lightly loaded system? We have few users but given each pay a lot their experience absolutely matters. We currently alert on average ELB latency being over 1 second and it’s become alerting noise at this point so I want it stopped. When there’s maybe 2 concurrent requests and a user is uploading a 200 MB file that sets off alarms and that’s just stupid. I was hoping to try to do URL-based latency patterns from our Splunk history and to only alert if there’s greater latencies than a calculated historical set of percentiles.

I might be overthinking this but I’d like our alerts to not suck given how relatively simple our system is but this is a common pattern I’ve seen with enterprise software so I figure it’s worth solving given I hate reinventing wheels. A single ELB to some middleware routing and everything else is a bunch of workers pushing and pulling on SQS queues talking to a couple plain RDS instances shouldn’t be tough to maintain in 2017.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
If the large POST request returned faster we’d have to scan the payload while it’s streamed (my suggestion) or just respond without scanning at all. The ELB latency metric doesn’t care what the request type is, it just cares about the time between the last byte leaving the ELB until the first byte back from the target instance. It’s not a strain metric at this point for us as much as “someone uploaded a file” metric because of this approach.

Nobody’s about to change the code over one lousy metric that only affects a single alert and two whole ops people. Heck, it should probably live off-instance for the data anyway to make it actually stateless (almost everything wound up being written a year ago to become ... more stateful, ugh).

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
Speaking of Splunk, we just got a Splunk Enterprise Cloud install (over 50GB of logs / day with around 9 million events daily) because it wound up being cheaper than Sumologic or any other SaaS log aggregation and analysis tool. Are they hurting for cash or something or is $70k / year a tiny account for them that they wouldn’t really respond to very simple support requests? It’s funny how Sumologic was supposed to be a cheaper alternative to Splunk and now it’s actually more expensive.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
Look for companies that are deploying on Nomad, Rancher, ECS, or preferably Kubernetes. Otherwise, I hope you’ve got a strong specialty in something besides ops like machine learning or Big Data BS. Even if the place is running on bandwagon technologies and it goes flop, you’ll have something that good companies want that skill set, too. Companies that haven’t been able to get their services deployed into containers by now are likely going to face some serious friction trying to quickly iterate safely and are at risk of dying from lagging their market.

If a place needs to do some stateful deployments, I hope you’re working on stuff like Consul, Kafka, Etcd, Dynomite, or some larger databases that you can’t just do A/B deploys in 2 minutes on a whim. Heck, even Elasticsearch is becoming old school at a point now.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
Half tongue in cheek, half serious. Legacy deployment gigs are pretty lovely in my experience partly because most deployment and architectures with stateful systems are half-assed in planning and roll-out that are awful to take down for maintenance if deployed lazily, not because there’s something wrong with a stateful system operationally (stateful systems without decent / tested / proven HA is something else). Nearly 9 years ago I was rolling out non-prod Hadoop clusters configured with early Puppet versions that was ugly but sure beat hand-configuring everything - where’s the progress since then as a community besides approaches that require stateless systems / architecture in half your systems? It’s disappointing that as a nobody with mediocre employer history and somehow Silicon Valley companies are still using the same software as these awful laggards when it comes to ops tooling despite how much better they’re supposed to be at software and systems. This isn’t the same as everyone using Git or Linux either.

Also, I literally have talked to engineers at Coca Cola and many other non-tech companies locally and only old, outdated stacks and applications that are not strategic are still doing classic Puppet and Chef and such. There’s a surprising amount of companies that have already deployed Kubernetes into prod and these are places I thought were far, far behind the curve. Even the middling compensation companies have managed to move to stateless service components and everything. Puppet and Chef based gigs are few and far between now and pay in the middle and are below $90k frequently (not very competitive that is).

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
I’m a strong NixOps and NixOS proponent and would like to use it in prod professionally. :swoon: Similarly, the guys at Fugue AKA Luminal are insane and have written a declarative cloud management system that actually keeps enforcing changes... as a Haskell DSL.

Saltstack is used in prod at a few places I’ve interviewed with and everyone seems to have no regrets - this alone gets me interested. It seems like it may have too small of a community and not ambitious enough of a backing company to compete with Chef and PuppetLabs though. Ansible is a pain with the many ways to fail to access a variable (there’s a cheat sheet a guy wrote showing dozens of ways to access specific variable kinds) while minor version upgrades completely break or silently change behavior making integration tests absolutely essential for it to not shoot your family after you go to sleep instead of adjusting a cron job schedule (seriously, all the cron job CM tooling I’ve ever seen has failed to work after more than 2 non-trivial changes - wtf is so hard about crontab management?).

I’m kind of curious if anyone’s tried to seriously deploy CloudFoundry and actually understand bosh and stem cells without becoming a High Priest at Pivotal. From what I’ve heard, one company in China used it to go from nothing to a fully fleshed out infrastructure of 10k+ nodes serving millions of users in a couple months from scratch, and that already beats anything near that scale I’ve done with any CM tool I’ve used.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
You’ll probably want to use an ASG Lifecycle hook and write some scripts to do the validation like checking for last modification times.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
There’s always the thorny issue of needing an SCM plugin like Github or Bitbucket if you want to use Jenkins multi-branch pipelines. But after seeing how the Bitbucket plugin starts to cause socket leaks gradually somehow in the Jenkins master (varying upon branch scanning frequency and number of branches on average per repo it appears) I’m inclined to believe that even using those plugins is too much for a sane, scalable Jenkins server to handle without me rebooting or restarting it weekly and HA it around with some horrific NFS backed Jenkins home directory. It’s kinda scary how my current place with 1/4 the number of developers as my last place easily quadruples the commit / build frequency as my last place and shows the warts of Jenkins’ master nodes far faster as a result.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
Moving a monolithic Java application is typically a problem not because of Java but because the typical company writing in Java is simply not going to be able to deploy anything remotely modern and fitting the patterns that Kubernetes really needs for you to be successful. I have tried to do the migration steps just to get applications somewhat stateless and monitored at about 15 different companies / customers now and basically all of them are failures for cultural reasons rather than some technical reason that keeps them stuck on 90s style J2EE app servers. I’ve seen monolithic Django apps failed to move to Docker containers, similarly.

If it requires more than about 30% of the code to be changed, you are empirically better off completely rewriting the application. I really want to find that paper but I’ve seen it before and it was eye-opening just how hard it is to maintain and migrate software systems.

Go greenfield with K8S in such companies or don’t even try. I really mean it.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
I went from a place that had about 12 m4.2xlarge nodes in production running kops-based k8s to a shop that doesn't even have one application capable of running in Docker, but that's code inertia and letting devs do whatever they want for years though (they'll write super stateful stuff that can't tolerate even a SIGTERM and fail to cleanly shutdown by default). That ran about 3 services - that's right 3 because we had 2 regions, 3 AZs, and each service took about on average 8 GB of RAM (welcome to 8 years of Groovy-powered enterprise shitware).

The hardest part of k8s in my experience isn't even deploying it or managing it, it's trying to get clunky old legacy software to be able to take advantage of it via stateless-ish designs. Also, explaining tactfully why you shouldn't put MySQL, Postgres on K8S sans CitusDB or similar to save money. Stateful software isn't impossible but the risks are too much by default. I'm considering using StatefulSets for our current set of services that take 15+ minutes to shutdown oftentimes but finding time is the issue when you're busy rolling clusters one node at a time by hand like it's 1999.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
If they manage to keep K8S like they do CentOS v. RHEL it may work out ok. If CoreOS was going to go under anyway for one reason or another then RedHat probably would have taken over it anyway. This way, we get to watch the mass exodus typical of acquisitions and see the resulting distribution (although Illumos has been a bit frustrating to watch given the good stuff in there but so little market traction).

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
My issue with the SRE book is that it was more a collection of stories and blogposts as long form than a book with a clear narrative and guideline. A lot of the material seemed to overlap thematically in a way that seemed redundant. On the other hand, Programming Pearls is similar that way but at least I didn’t think I was getting a paragraph on fixing binary search integer overflow for the fifth time like the SRE book spent talking about monitoring subtopics. I still think Time Management for System Administrators is more important of a read for today’s engineers, honestly. While we learn as engineers how to optimize programs for time and space efficiency, it amazes me to see people fail to spend even 1000th the effort on managing their own time when that’s probably the greatest limiting factor for your programming output in the end. I say this as someone so terrible at time management it’s part of why I decided to never have children. Great goal setting and time management is among the traits of those with long-term career and life success more than knowing more ways to sort data structures than others.

Adbot
ADBOT LOVES YOU

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
There is a great need for people that know what they’re doing in most organizations because most places over time just plain rot on the operations side because it’s typically a cost center and leads to horrible attrition rates. One place I’m familiar with wound up trying to aggressively recruit kids just out of school to replace senior ops engineers that were leaving by the droves and even the contractors were not biting.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply