what the fuck is prometheus anyway? a thread about monitoring - The Something Awful Forums

Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > what the fuck is prometheus anyway? a thread about monitoring

Arcteryx Anarchist: Sep 15, 2007; Fun Shoe

one of my previous jobs was basically working on the monitoring project for a large non-tech company and it gave me like PTSD from the horrendous politics of it

i'm strangely getting involved in it again at my current workplace but this is a much more well-scoped case

# ¿ Feb 25, 2019 23:03

Adbot: ADBOT LOVES YOU

# ¿ May 14, 2024 13:05

Arcteryx Anarchist: Sep 15, 2007; Fun Shoe

i could probably do some kind of experience based long-post on it because it sucked rear end

# ¿ Feb 25, 2019 23:05

Arcteryx Anarchist: Sep 15, 2007; Fun Shoe

so basically this was an early job for me so its a good example of being young and naive and letting a company screw you over pretty hard and how bad projects can be in these big organizations

this is like mid-late 2000s, when I came on they had a couple applications they used for monitoring -- one that was a licensed solution they used to keep an eye on web applications by hitting pages, which was becoming increasingly irrelevant thanks to being an old product and we were entering the age of javascript hell that it had no idea how to deal with (much less some of the Apache Wicket hell that floated around for a while)

the other was a home-grown java based application that had been written some years ago by some members of an "architecture group" that was later broken apart. It was a somewhat more "flexible" solution, that had a lot of different pieces of functionality to poll things from server metrics, linux daemon statuses, various types of message queues, some custom log scraping, accessing our in-house service API, etc etc and shove all this time-series oriented information in a standard database. it wasnt organized in such a horrible fashion that it was too terrible to comprehend and maintain, though it did take advantage of some weird libraries for unknown reasons, and it had some design decisions that were obviously intended to get ahead of any scalability issues that were probably flawed and wouldn't really work. it also had a cold-fusion based front end for some reason, likely because there were a number of cold-fusion apps in the company and the person that developed that portion was just going with what they knew. we actually rarely used it because in the end it didn't work very well as the number of things being monitored grew -- instead we used some other swing-based application that could better deal with things

i should also point out that operations folks and developers didn't really have much access to either one of these systems; in fact, trying to remember it now, im not sure if they could view the websites that sat in front of them at all -- the operational people might have just been able to see information from alerts that appeared in a legacy console they used, as well as some of those lovely operations status displays sprinkled around the building that CIOs love, and some daily reporting summary pages. the reasoning for any of this is management didn't "trust" any of them to handle these things

as you might have picked up, this place was kind of a technological mess -- there was a legacy mainframe system that had served as the core platform for a lot of the core operations of the company, as well as some of the other business operations, for many decades; there were also a large number of linux servers running various distros (which would eventually become a large number of virtualized linux servers), windows servers, hell there were even some tandems sitting around iirc though I think I never had to deal with those, and along with all these different systems were a multitude of application stacks. Some applications we be developed in-house, others would be contracted out, and in many cases a lot of work was being done by different organizations within the company outside of what would be considered the main IT department -- like marketing or accounting or something might create their own IT group or contract something out thanks to the terrible power dynamics of corporate america

batch database jobs, mainframe terminals, web applications, native applications, etc

the scope of these applications might just be some boring back office accounting and reporting stuff, or they might be something more critical or even safety-impacting to the field operations of the company -- so also all over the place

either way, the (small) group i was in had somehow become responsible for covering it all, if it had been covered up to that point, or even trying to expand that coverage; we served as both the developers and support people -- this was a huge PITA because it meant trying to deal with an external facing ticketing system where we would have to perform tons of work for teams since they weren't allowed to do any of it themselves as well as our development work

so that kind of lays out the situation i was in

Arcteryx Anarchist fucked around with this message at 21:25 on Feb 26, 2019

# ¿ Feb 26, 2019 21:23

Arcteryx Anarchist: Sep 15, 2007; Fun Shoe

so year one for me was basically spent learning how to maintain the in-house monitoring system. i was the only real employee on this team outside of the manager, there were 4 other people, 3 of whom were contractors and 1 intern. i've kind of mentioned it before, but this companies use of on and offshore contractors was pretty widespread, and i feel anymore that the way they were engaging in this was likely against labor regs but what do I know im not a lawyer

2 of the contractors had been on the project for a little while so they knew a lot more about these systems than I did (none of the people on the project were around when the systems were implemented) -- later both of them would not have their contracts renewed as the company went to "exclusive" contracting with a big firm, and they also wouldn't exactly be replaced; they were kind of backfilled by 2 offshore contractors that were more or less worthless since they worked completely different hours so they couldn't really handle doing any of the support ticket work without terrible turnaround times and they weren't that great at doing any of the development work either. im pretty sure the entire reason they were brought on was for the manager's career development, allowing them to claim to have managed offshore resources on a project.

there was also a newer version of the in-house monitoring application that was being developed by a person that had left the company and I basically back-filled the position of (though probably at a lower title and pay rate). in true corporate fashion it was basically the same application but completely re-written with a few changes to libraries and other bits mostly to address something that developer didn't like about the existing application, but also a few new features as well, including a more modern UI that could actually be used to manage the application. i also spent a good portion of my time trying to complete this new version, though in the end this would take a few years to finally release, after a change in management and even a complete re-location within the org chart and pushing back on a fair amount of feature creep.

there would of course be a third iteration of application re-write because all of the scaling flaws were still in it, but i didn't stick around long enough for the completion of that one

thanks to this also being around the time of the financial collapse, my compensation for handling this bumpy first year and keeping my head above the water and everything from falling apart was a 1% or 2% raise and I think a $1000 pre-tax bonus, and this was basically the story for a couple years until my management changed and I got a few title bumps and better comp bumps and bonuses

i think i had more or less been the victim of the career building of the previous management in these early days, as they had gone on to get an MBA and moved from being the head of my small group to maybe a director (basically one step below AVP) by the time I had left; they were on a surface level a very nice person which just kind of underlines how systemically evil this stuff can be

Arcteryx Anarchist fucked around with this message at 21:46 on Feb 26, 2019

# ¿ Feb 26, 2019 21:40

Arcteryx Anarchist: Sep 15, 2007; Fun Shoe

now for some of the horrendous politics of this stuff

since my group had to do most of the work of managing all this monitoring in addition to the development of the systems to perform it, i got tangled up in the office politics of it all, and monitoring is easily one of those things that goes from being the most important thing in the world all the way up to the CIO level one day, to being derided as a meaningless cost center the next

problem 1: SLAs

so the organization had picked up on using the concept of SLAs for various key applications to help keep the people running those applications accountable and relate what the importance of those applications might be, and the operations folks were in charge of keeping track of if those SLAs were being met and my team basically supplied data to provide insight for that

of course, this also meant that various people likely had meeting those SLAs as part of their performance evaluation, which really just creates an incentive for trying to weasel your way out of anything that makes it look like you might not be meeting your objectives, mostly through blame shifting, and if another application or some part of the application stack couldn't be blamed, then the monitoring itself would become the target because we had obviously not assigned correct blame or inaccurately reported something in their opinion. now, im not going to say they were entirely wrong in some cases with some of this blame shifting, outside of their terrible motivations for it. like many corporate application developers, all they really cared about was hitting shiny feature requirements while the underlying application might be a tech-debted, rotting, unstable mess -- but a mess they hoped they could shift responsibility to others for

sometimes that responsibility shifting might hit the level where they almost wanted our monitoring application to more or less become a reliability wizard for their own rotten application, somehow keeping it propped up through all failure modes, but thankfully this is one bit of scope creep we were able to successfully repeatedly push back on

but there were plenty of meetings that were basically some manager or other application person trying to bully operations people or us into "correcting" their reported SLA metrics (which they sadly often more or less won)

# ¿ Feb 26, 2019 22:02

Arcteryx Anarchist: Sep 15, 2007; Fun Shoe

problem 2: dumb-ops

its not that the people in the operations center were actually dumb, its more that the company was kind of cheap and pretty much tied their hands; since the IT operations center is a 24-hour, 3-shift outfit, you can probably reduce some of your costs by reducing their responsibilities to argue for compensating them less

they pretty much had all their visibility into systems in the form of a legacy console that had likely started to be used in the late mainframe days, and some of those reporting dashboards developed by our group, and they were basically instructed to look into things when something was "bad" according to them

they had to troubleshoot things based on documentation (in some old lotus document dabase no less) that was of varying quality, and if they couldn't seem the resolve the issue, they would move on to calling developers; thats right, as a developer in this organization you had the perk of basically being expected to do 24/7/365 on-call support; you might be able to push off getting called until business hours if you could effectively argue that your app wasn't that important, but generally this didn't seem to happen because most managers can easily reason themselves into thinking any application is actually totally business critical. most teams were at least pretty good at rotating this responsibility among themselves though, unless you were on my tiny team where at one point it was basically rotated between me and the only other developer that was actually located in-office on a bi-weekly or maybe at one point monthly basis

of course, as the monitoring person, i would get called a lot

i had to be on the deployment calls, because of course applications don't want that downtime to count towards their SLAs and operations doesn't want to see alerts for applications that are "obviously" down, and since management didn't trust this stuff to be managed by anyone but us, we had to manage doing the blackouts for those windows

and of course a developer getting called at 2AM for a supposed problem, with no compensation for this, and when they look into things feels like something is not a "real" issue, wants to blame shift the issue to us so I get called for that to either turn something off or tweak something else

# ¿ Feb 26, 2019 22:19

Arcteryx Anarchist: Sep 15, 2007; Fun Shoe

problem 3: resources

of course, as an application that would bounce between being the most important thing to meaningless, the resources we were given were often third-rate

i was some green developer that ended up being in charge of an application that had performance and scalability needs that were quickly growing and I had little to no idea what I was doing there, and the other persons that might work on it were basically offshore or on-shore contractors with a background purely in the exact kind of unstable rotten business applications that we were trying to monitor, and the team was always pretty small, especially considering the size of the rest of the organization

we were eventually shoved onto the same virtualized machines they were trying to shove most of the other applications onto, and we had to share a databse instance -- basically the monitoring was living in the same infrastructure space as the applications and infrastructure we were expected to monitor

one amusing anecdote from this was when there was a mysterious database throughput issue in the organization that we were even impacted by and were brought in to discuss because we were also expected to have some metrics on what might be happening even while being impacted as well (i think it ended up being some massive network and/or other backplane congestion from backups)

# ¿ Feb 26, 2019 22:27

Arcteryx Anarchist: Sep 15, 2007; Fun Shoe

another fun anecdote: a kind of career recognition theft

so one particularly high-value operations-related large organizational effort decided to develop their own kind of monitoring front end; a super fancy set of dashboards and navigation tools to help provided visibility to all kinds of things about their sprawling mess of applications; the operations center was going to be expected to use this as their kind of source of record for application status in this particular area

they developed this whole thing with a team of their own that was probably 10 times or more larger than mine easily

and in the end it had no system to provided the data to back it -- it was basically just a fancy UI and I think a graph-database to help organize the displayed information

they then of course expected us to magically now integrate and provided the exact kind of data they wanted to back it, and used their much heavier organizational hitting power to bully us into it; it was a nightmare and I don't think it ever ended up getting finished and was probably a total waste of money

# ¿ Feb 26, 2019 22:35

Arcteryx Anarchist: Sep 15, 2007; Fun Shoe

this was all in a place where the CIO regularly told the entire IT organization it was a Cost Center to curtail any demands that might come from it

i swear if i could go back today i would love to tell him to his face if its such a cost center I have a great idea to save the company tons of money by walking down to the datacenter right now and pulling the switch, then we can work on making back some money surplussing all that equipment -- hell we might even make a little money on the mainframe parts, plenty of other places still using that stuff and its probably hard to come by spares

this man made millions of dollars

# ¿ Feb 26, 2019 22:40

Arcteryx Anarchist: Sep 15, 2007; Fun Shoe

anyways I think thats it for now

# ¿ Feb 26, 2019 22:47

Arcteryx Anarchist: Sep 15, 2007; Fun Shoe

uncurable mlady posted:

lol this owns in a really lovely way

what's also fun about this is they planned to sell/license this as part of the larger system they were building in this kind of plan to kind of be like the SAP of that particular space

i kind of wonder if they still think thats happening or its more of what I kind of always suspected -- a "good idea" that has toxic effects on how the applications are written and in the end it's not even possible to sell it and there's likely no market for it anyway

it was the kind of place that liked to promote from within, which might be seen as nice because it makes you feel like you can have a career there if you can play the politics and everything, but at the same time i feel like they had a serious blind spot in how competent they really were as an organization since they rarely had a lot of true outside perspective

like some of the "principle engineers/architects/whatever" might have been completely incompetent but given a huge amount of power, and they had no real idea of their incompetence because their entire career had been within an organization without any real external check of competency and now they had a big title and power and probably a fat salary so they felt on top of things

# ¿ Feb 27, 2019 20:18

Arcteryx Anarchist: Sep 15, 2007; Fun Shoe

i didn't know adam posted in yospos

# ¿ Mar 8, 2019 20:33

Arcteryx Anarchist: Sep 15, 2007; Fun Shoe

tbh a lot of software people consider magical scaling wizardry is a nightmare and I�m convinced the people bringing it in flee before the consequences hit or hav never used it beyond toy projects

# ¿ Mar 9, 2019 07:14

Adbot: ADBOT LOVES YOU

# ¿ May 14, 2024 13:05

Arcteryx Anarchist: Sep 15, 2007; Fun Shoe

like spark had a super broken memory model for quite a while, lots of the Hadoop stack is brittle and needs a lot of babysitting

like the noteworthy parts of this stuff is it helps make some stuff feasible but it isn�t �good�

# ¿ Mar 9, 2019 07:17

1
2
3
4
5

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > what the fuck is prometheus anyway? a thread about monitoring

Powered by: vBulletin Version 2.2.9 (SABB-v2.24.05)
Copyright ©2000, 2001, Jelsoft Enterprises Limited.
Copyright ©2024, Jeffrey of YOSPOS