|
I ran Ceph a few shops ago, and can concur, it requires a ton of operational overhead to manage.
|
# ? Jul 1, 2019 23:25 |
|
|
# ? May 15, 2024 04:48 |
|
What kind of problems did you guys encounter running Ceph?
|
# ? Jul 1, 2019 23:30 |
|
chutwig posted:Unless you have a team ready and waiting to support Ceph, I would recommend contacting your local NetApp VAR. trem_two posted:Yeah. Don't run ceph yourselves. Our ops team was given a mandate to run ceph. It was a bad time. Very very bad time. Gyshall posted:I ran Ceph a few shops ago, and can concur, it requires a ton of operational overhead to manage. Thanks guys, that's a pretty clear consensus and about as bad as I expected. Hadn't heard of NetApp before. What's their elevator pitch minus the bullshit? Their "hybrid cloud" product page sounds like you're get a middleman's vendor lock-in in exchange for avoiding vendor lock-ins higher up the food chain.
|
# ? Jul 1, 2019 23:54 |
|
What's wrong with Ceph? I ran a very small cluster for kubernetes and had to screw around with it like once, and it was my own fault.
|
# ? Jul 2, 2019 00:08 |
|
Jesse Iceberg posted:What kind of problems did you guys encounter running Ceph? Methanar posted:What's wrong with Ceph? I ran a very small cluster for kubernetes and had to screw around with it like once, and it was my own fault. I think Ceph is a very interesting piece of software. I ran it at petabyte-scale for OpenStack and as object storage (though I left the team responsible before the object storage went full production). At large scales, you have to think about stuff that really doesn't matter in toy clusters, especially if the storage cluster is not completely homogeneous. Even small changes in the CRUSH map, like adding a few more OSDs, can trigger massive data migrations. You will also find out real fast who the small percentage of your user base is who completely kick the poo poo out of the system. It looks like there is experimental QoS support in Mimic now, which did not exist in the versions I used, but it's extremely susceptible to noisy neighbor issues still, and it's never going to be low-latency enough for a lot of applications. etcd on one of the Ceph clusters I used to admin takes about 3 orders of magnitude longer to fsync than on bare metal servers, and you definitely notice when your fsyncs take 250ms.
|
# ? Jul 2, 2019 01:17 |
|
chutwig posted:I think Ceph is a very interesting piece of software. I ran it at petabyte-scale for OpenStack and as object storage (though I left the team responsible before the object storage went full production). At large scales, you have to think about stuff that really doesn't matter in toy clusters, especially if the storage cluster is not completely homogeneous. Even small changes in the CRUSH map, like adding a few more OSDs, can trigger massive data migrations. You will also find out real fast who the small percentage of your user base is who completely kick the poo poo out of the system. It looks like there is experimental QoS support in Mimic now, which did not exist in the versions I used, but it's extremely susceptible to noisy neighbor issues still, and it's never going to be low-latency enough for a lot of applications. etcd on one of the Ceph clusters I used to admin takes about 3 orders of magnitude longer to fsync than on bare metal servers, and you definitely notice when your fsyncs take 250ms. Great post, and echoes my experience. My team was a team of 6 SRE plus myself, and we had problems with knowledge of distributed storage systems, as well as performance tuning. We inherited the implementation, which was built on JuJu and some other crap, but the learning curve was still a lot for the team where our efforts probably would have been spent elsewhere. I love the idea of storage clustering in user land though. Ceph is a really impressive piece of tech.
|
# ? Jul 2, 2019 01:29 |
|
NihilCredo posted:Thanks guys, that's a pretty clear consensus and about as bad as I expected. NetApp's a storage vendor who's been around since about the time dinosaurs roamed the earth. Give them a call, or EMC, or Pure, or Nimble, or any one of the other storage vendors out there who will take no small amount of your dollars but will hopefully ease your support burden by giving you a platform that will probably do what you want it to do until you start pushing it hard, and then you'll find out what the phrase "WAFL inode exhaustion" means.
|
# ? Jul 2, 2019 01:53 |
|
NihilCredo posted:I'm looking into a storage abstraction for a product so it can run with no code changes in anything from a piddly under-the-table VM with a plain HDD to a MS/AWS/GC environment with that vendor's blob storage. Ideally, it would also handle a "I have a poor man's datacenter, N physical machines running orchestrated containers and some network storage (either a NAS or even each one with their HDD), please replicate my data as much as you are able without giving it to those icky cloud vendors" scenario, which I'm really hoping to avoid but cannot dismiss out-of-hand . I’m trying really hard to read your post in a way that doesn’t make my answer “the kernel’s file system layer” but I’m failing. Why would your app code need to know how the storage is provisioned?
|
# ? Jul 2, 2019 05:30 |
|
Funny Ceph story: My local vmware administrator didn’t want to pay for the license cost of VSAN and only had a boatload of localdisks (144 3TB drives across 4 esxi nodes). He recently came to me asking if running ceph instead was a good idea. It was hard to stop myself from asking if he was loving with me or not.
|
# ? Jul 2, 2019 08:37 |
|
chutwig posted:NetApp's a storage vendor who's been around since about the time dinosaurs roamed the earth. Give them a call, or EMC, or Pure, or Nimble, or any one of the other storage vendors out there who will take no small amount of your dollars but will hopefully ease your support burden by giving you a platform that will probably do what you want it to do until you start pushing it hard, and then you'll find out what the phrase "WAFL inode exhaustion" means. I suppose it has nothing to do with delicious waffles? Kevin Mitnick P.E. posted:I’m trying really hard to read your post in a way that doesn’t make my answer “the kernel’s file system layer” but I’m failing. Why would your app code need to know how the storage is provisioned? It is my understanding that mounting S3 or Azure Blob Storage as a filesystem is possible but neither simple not recommended, compared to using their proprietary REST APIs.
|
# ? Jul 2, 2019 09:31 |
|
On the other hand, for use cases where your needs are pretty fixed, you aren't making changes to the storage environment often, and you don't have any particularly stringent performance requirements for your workloads, Ceph is pretty low-touch. Use cloud as a cornerstone of your scaling strategy, but there's way worse ways to configure nearline disks you already own.
Vulture Culture fucked around with this message at 19:53 on Jul 2, 2019 |
# ? Jul 2, 2019 15:19 |
|
NihilCredo posted:I suppose it has nothing to do with delicious waffles? It’s not really a worse idea than trying to pretend a single HDD is S3
|
# ? Jul 2, 2019 15:36 |
|
Vulture Culture posted:On the other hand, for use cases where your needs are pretty fixed, you aren't making changes to the storage environment often, and you don't have any particularly stringent performance requirements for your workloads, Ceph is pretty low-touch. Use cloud as a cornerstone of your scaling strategy, but there's way worse ways to configure nearline disks you already own. I always caution people about Ceph not because I don't trust it, but because I've seen boredom-driven development result in too many instances of people thinking the toy Ceph or Kubernetes cluster they have on their MBP is going to scale up to the demands of production workloads and that it's always going to work fine forever. I think it's a wonderful piece of technology, but nobody wants to be caught in a situation where their Ceph cluster has gone down at 2 AM and you have no idea what to do to fix a corrupt monmap and the CTO is mumbling through an Ambien-induced stupor on the conference bridge. Distributed systems are hard to admin and debug, storage systems are very difficult to dislodge once in place, and combining the two can result in the perfect storm of Ultimate Fuckery. I think most people in here know this, and it's what I routinely try to impress upon my bright-eyed juniors fresh out of college.
|
# ? Jul 2, 2019 23:42 |
|
So I recently got hired into a devops position at a big corporation which is cool but it was a bit of a surprise because I interviewed for a basic entry level software developer position where they got all excited that I knew what "agile development" and "sprints" were and gave no indication that I was going into devops until right before I started and I basically knew nothing other than the general concept going into it. So I'm looking for any good websites/books that are good for learning core devops stuff, ranging from basic overviews to very technical and detailed. Probably stuff that isn't specific to a particular tool since I can just look up documentation for those easily enough. Also if there're any good devops related tech blog sort of things to follow. I just want a better understanding of the whole picture, the tool specific stuff I feel like I can just learn on the job. I saw this Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation book mentioned very early on in the thread but that was from some years ago so I don't know if there's something more recent. Please give me any recommendations or advice.
|
# ? Jul 4, 2019 00:06 |
|
Can you talk about your actual job duties and what technologies you’re expected to use day to day? “DevOps” has come to be such a broad term that it’s meaningless out of context.
|
# ? Jul 4, 2019 00:13 |
|
Turbl posted:So I recently got hired into a devops position at a big corporation which is cool but it was a bit of a surprise because I interviewed for a basic entry level software developer position where they got all excited that I knew what "agile development" and "sprints" were and gave no indication that I was going into devops until right before I started and I basically knew nothing other than the general concept going into it. So I'm looking for any good websites/books that are good for learning core devops stuff, ranging from basic overviews to very technical and detailed. Probably stuff that isn't specific to a particular tool since I can just look up documentation for those easily enough. Also if there're any good devops related tech blog sort of things to follow. I just want a better understanding of the whole picture, the tool specific stuff I feel like I can just learn on the job. Both you and your new employer are looking at devops wrong. It's not a job role, a particular set of tools, or a project management methodology. Devops is a cultural shift within an organization. Having a "devops position" almost always means that the company doesn't actually understand devops and is going to just have an additional devops silo beyond the usual deveopment, operations, and QA silos. If you want to really understand devops, go read about the devops culture shift and practices that make up devops organizations successful. If the response to "a release is failing" is "that's Turbl's problem, they're the devops person", then your company is not doing devops.
|
# ? Jul 4, 2019 00:21 |
|
Docjowles posted:Can you talk about your actual job duties and what technologies you’re expected to use day to day? “DevOps” has come to be such a broad term that it’s meaningless out of context. There's a team of us and (some) people within the team "own" a particular tool and everyone on the team learns and manages the tools and works with the app teams to help teach them the CI/CD tools and support them when needed. Tools we own include version control tools (currently moving all app teams off subversion to bitbucket or TFS git), jenkins, ca ra, artifactory, TFS, upsource, and a few code scanning tools. There may be some other smaller tools I don't even know about yet. Also using docker and starting to use kubernetes. We're basically trying to get all applications working with automated deployment and improving the process. There are over 500 applications on ca ra right now and thousands of active jobs in jenkins so it's a lot. I know they're trying to get applications in line with the 12 factor methodology. If it sounds like I don't know what I'm talking about it's because I don't really yet. Day-to-day basically seems like working on various projects (the subversion migration, kubernetes stuff that I don't know a lot about yet, streamlining processes), meeting with app teams, giving talks about newly implemented things, troubleshooting problems with any of our tools, and performing tool updates. Turbl fucked around with this message at 01:00 on Jul 4, 2019 |
# ? Jul 4, 2019 00:53 |
|
New Yorp New Yorp posted:Both you and your new employer are looking at devops wrong. It's not a job role, a particular set of tools, or a project management methodology. Devops is a cultural shift within an organization. Having a "devops position" almost always means that the company doesn't actually understand devops and is going to just have an additional devops silo beyond the usual deveopment, operations, and QA silos. If you want to really understand devops, go read about the devops culture shift and practices that make up devops organizations successful. Or using the term devops wrong. Company I'm at now actually does devops, teams deploy and operate stuff they own, devs are in the oncall rotation, etc. But there's all the stuff used by everyone and owned by no one--CI server, deploy tooling, the basic EC2/k8s infra, glue stuff to make our SSO vendor integrate with things that doesn't support SAML, etc. Devops team owns all of it, though we could be called Infra or Commons. As you might imagine, it's sometimes unclear whether something falls into our jurisdiction or IT's.
|
# ? Jul 4, 2019 03:35 |
|
I’m the overlord of our cloud. The gatekeeper. The automation man. I am the devop.
|
# ? Jul 4, 2019 03:57 |
|
Turbl posted:the subversion migration Yikes. Good luck sir, I just got done doing a similar migration, you will learn a lot about many things
|
# ? Jul 4, 2019 05:09 |
|
JHVH-1 posted:I’m the overlord of our cloud. The gatekeeper. The automation man. I am the devop. The job title cloud wrangler confuses vendors enought to make them leave you alone
|
# ? Jul 4, 2019 14:32 |
|
JHVH-1 posted:I’m the overlord of our cloud. The gatekeeper. The automation man. I am the devop.
|
# ? Jul 4, 2019 15:22 |
|
I told the interns this year that my job is "computer janitor". I helpfully explained that I "Janitor the computers, you know, tidy up the cloud"
|
# ? Jul 4, 2019 16:00 |
|
New Yorp New Yorp posted:If the response to "a release is failing" is "that's Turbl's problem, they're the devops person", then your company is not doing devops. Very much this. I'm our "DevOps Lead" with a few engineers, but our role is to get everyone to adopt the mindset through tooling, process, and culture changes. The aim is to shrink & eliminate the team as the capability & autonomy of the wider IT department grows. I Cancelbot fucked around with this message at 20:21 on Jul 4, 2019 |
# ? Jul 4, 2019 20:16 |
|
Cancelbot posted:Very much this. I'm our "DevOps Lead" with a few engineers, but our role is to get everyone to adopt the mindset through tooling, process, and culture changes. The aim is to shrink & eliminate the team as the capability & autonomy of the wider IT department grows. I Same. Our front end devs acted really surprised when I started to ask them what they tried to do solve their failing builds. I’m currently building a team that will improve the release automation as well as educate and help all developers to adopt the mindset of ‘you ship it, you run it’. We’ve got the shipping part down, the running it mindset is slowly getting there. Too bad C level fired the 3 project teams that were actually doing this and helping all new teams to do the same. Reason: they have some major bonusses coming up after finalizing a major acquisition and need to keep the costs in check. This basically means the rest of the year they’ll Thanos snap half of all projects and no budgets can be changed anymore.
|
# ? Jul 4, 2019 21:19 |
|
I'm running a Node App on CENTOS through cpanel with A2 Hosting using the Node App Setup thingy and I've gotten it all installed but when I actually go to run the Node start script I'm getting a weird memory error: Run JS script returncode: 244 stdout: stderr: npm ERR! path /home/coreaho2/app npm ERR! code ENOMEM npm ERR! errno -12 npm ERR! syscall scandir npm ERR! ENOMEM: not enough memory, scandir '/home/coreaho2/app' glob error { [Error: ENOMEM: not enough memory, scandir '/home/coreaho2/app'] errno: -12, code: 'ENOMEM', syscall: 'scandir', path: '/home/coreaho2/app' } npm ERR! A complete log of this run can be found in: npm ERR! /home/coreaho2/.npm/_logs/2019-07-04T21_36_59_706Z-debug.log The thing is if I actually telnet into it and run the command manually it works fine? Sorry for lack of additional info, phone posting.
|
# ? Jul 4, 2019 22:43 |
|
i tell people i "do all the cloud poo poo"
|
# ? Jul 5, 2019 21:07 |
|
Ape Fist posted:I'm running a Node App on CENTOS through cpanel with A2 Hosting using the Node App Setup thingy and I've gotten it all installed but when I actually go to run the Node start script I'm getting a weird memory error: i won’t ask why you are using cpanel, but i suspect your problem is specific to that it’s real bad, and it’s about to get a lot more expensive
|
# ? Jul 7, 2019 06:00 |
|
Ape Fist posted:The thing is if I actually telnet into it and run the command manually it works fine? Please tell me you don’t actually have TELNET enabled on a server in tyool 2019....
|
# ? Jul 7, 2019 07:08 |
|
LochNessMonster posted:Please tell me you don’t actually have TELNET enabled on a server in tyool 2019.... It's a secure SSH connection via putty any some I remote into something via CLI its Telnet to me sorry I'm just a developer I'm too stupid to know all the other words.
|
# ? Jul 7, 2019 08:02 |
|
i would search for the error on the cpanel forums, or go source diving to see how cpanel is starting your node app there’s gotta be some cgroup, quota, limit or something or maybe it’s using a different nodejs install from when you get on the CLI without cpanel btw i’m curious now, and i wanna know why you have to use cpanel
|
# ? Jul 7, 2019 14:38 |
|
Helianthus Annuus posted:i would search for the error on the cpanel forums, or go source diving to see how cpanel is starting your node app I wasn't using it at all until I realised I can't just start a mode process and detatch it or run nohup or forever or something without the provider eventually just terminating the process after a few hours of inactivity. It's a part of their (lovely) policy for some reason so they just strong arm you into using the really bad cpanel node app setup which is unintuitive and clearly doesn't work very well. I'm probably not going to stick around with these guys (A2 Hosting) but they were cheap and had a flat price cap.
|
# ? Jul 7, 2019 14:51 |
|
yikes you’re in a bad way. i wanna try to help you out pm sent
|
# ? Jul 7, 2019 21:03 |
|
NihilCredo posted:I'm looking into a storage abstraction for a product so it can run with no code changes in anything from a piddly under-the-table VM with a plain HDD to a MS/AWS/GC environment with that vendor's blob storage. Ideally, it would also handle a "I have a poor man's datacenter, N physical machines running orchestrated containers and some network storage (either a NAS or even each one with their HDD), please replicate my data as much as you are able without giving it to those icky cloud vendors" scenario, which I'm really hoping to avoid but cannot dismiss out-of-hand . The S3 API is pretty much supported everywhere it seems. You could take a look at Gluster.
|
# ? Jul 8, 2019 04:55 |
|
Mr Shiny Pants posted:The S3 API is pretty much supported everywhere it seems. You could take a look at Gluster. Try that and your system will end up not working with real s3 unless the emulator vigorously enforces metadata inconsistency. S3 provides fewer guarantees than almost any storage system and stale reads are common
|
# ? Jul 8, 2019 06:13 |
|
Nomnom Cookie posted:Try that and your system will end up not working with real s3 unless the emulator vigorously enforces metadata inconsistency. S3 provides fewer guarantees than almost any storage system and stale reads are common So if you write it to the S3 spec it can only run better anywhere else?
|
# ? Jul 8, 2019 08:28 |
|
Helianthus Annuus posted:yikes you’re in a bad way. i wanna try to help you out when she say she gonna dm u but she don't
|
# ? Jul 8, 2019 08:36 |
|
Mr Shiny Pants posted:So if you write it to the S3 spec it can only run better anywhere else? That's the problem, yeah. If your design is supporting a bunch of different backends all accessed through an S3 API, what are dev and test going to use? Most likely an S3 emulator written in nodejs that stores everything in /tmp. Customers hook the product up to S3 and then come to you with weird-looking failures, and dev has a real good time trying to diagnose. Once someone finally digs through the docs enough to find the page on all the ways S3 will screw with you for funsies, it's a major ops effort to build out an integration environment that's even capable of reproducing. We already have something that abstracts over local disk, network disk, and cloud disk--it's called a filesystem. The differences between those backends are already great enough that the abstraction has worrying leaks. Don't make it worse by trying to add a blob store option.
|
# ? Jul 8, 2019 16:34 |
|
If it's AWS + Linux, then surely the answer is EFS? It literally is a filesystem, i'm not great at linux but you just mount EFS on all the machines you want. https://aws.amazon.com/efs/ https://docs.aws.amazon.com/efs/latest/ug/efs-onpremises.html <-- If your poo poo is at your office. Even has nice things like lifecycle management, backups (with DataSync), and encryption at rest. You can use DataSync to huck it straight to S3. Windows can use either Storage Gateway or FSx. If you really want this to span multi-cloud then it's going to be very difficult no matter which way you slice it. Cancelbot fucked around with this message at 12:53 on Jul 9, 2019 |
# ? Jul 9, 2019 12:49 |
|
|
# ? May 15, 2024 04:48 |
|
Turbl posted:I saw this Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation book mentioned very early on in the thread but that was from some years ago so I don't know if there's something more recent. Please give me any recommendations or advice. I feel like that's probably still a solid recommendation. I'd also be interested if anyone else had other thoughts. I read it in bursts over a long stretch of time, so I don't remember all the details super well. I want to say it's a bit more general in that it deals with practices/process than specific tools. That's an advantage in that the advice probably stays relevant longer and it should be broadly applicable. On the other hand, it may not help much if you were starting from scratch trying to figure out which tools you should use.
|
# ? Jul 9, 2019 16:27 |