Virtualization Megathread V2: VMs inside VMs

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Virtualization Megathread V2: VMs inside VMs

«‹›313 »

bull3964: Nov 18, 2000; DO YOU HEAR THAT? THAT'S THE SOUND OF ME PATTING MYSELF ON THE BACK.

So, I have a head scratcher.

I have two 2012 R2 DFS servers. The DFS volume is about 1.5tb and they are doing replication between them.

We're retiring storage, so I was doing a storage vMotion from our old equallogic storage to our Purestorage.

Both servers managed to corrupt the MFT and NTFS indexing upon completion of the vmotion, requiring taking the server offline for a chkdsk (which completely successfully in both cases.)

No errors were reported during the storage vMotion and we've never seen anything like this before on any other machine. It also only corrupted the DFSR volume and not the OS volume. The host that was doing the storage vmotion was 6.0.

The very first indication of anything wrong on either server was an ESENT error on the DFSR database " failed verification due to a lost flush detection timestamp mismatch".

That seems to indicate that some writes the OS may have cached never got committed. Storage subsystem has been rock solid though. There was nothing in any logs around that time and these two separate incidents happened 2 weeks apart.

I have a feeling if I turn off OS write caching on the DFSR volume, it will prevent something similar from happening in the future, but I would like to understand how this could have happened in the first place.

# ? Jun 5, 2015 17:41

Adbot: ADBOT LOVES YOU

# ? May 25, 2024 19:08

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Have you verified that this is actually related to Storage vMotion, and not a recent Windows update or something? Since you're at 100% already, it should be easy enough to reproduce by svMotioning a clean VM along the same path.

# ? Jun 5, 2015 18:51

bull3964: Nov 18, 2000; DO YOU HEAR THAT? THAT'S THE SOUND OF ME PATTING MYSELF ON THE BACK.

Vulture Culture posted:

Have you verified that this is actually related to Storage vMotion, and not a recent Windows update or something? Since you're at 100% already, it should be easy enough to reproduce by svMotioning a clean VM along the same path.

Well, at that point in time we hadn't updated those servers recently. The first ESENT error occurred within 60 seconds of the svMotion completion in both cases. So, the time corrorlation is pretty exact at that point. It seems like SOMETHING goofed up in the final quiescing, but I can't find any other logs that indicate what it could be. There were no events relating to IO hangs or anthing else prior to the first issue.

In both cases, NTFS errors started being logged within the hour after the first ESENT error. I know DFRS is pretty low level with NTFS journaling, so I'm not sure which is the chicken and which is the egg in this case. But when the DFSR DB blew up with that first error, replication was halted so I have a feeling the NTFS issues came first since no further writes would have been happening to the drive.

We have svmotioned other servers along that path before without issue, but they weren't doing DFSR. I'm just wondering if it's some quirk in the DFSR stack that could cause NTFS corruption if IO lags at some point (which I could see happening when quiescing after a 6 hour svmotion). We have svmotioned domain controllers before without issue, but I know those automatically disable write caching in the OS when you do a dcpromo.

# ? Jun 5, 2015 19:32

Gyshall: Feb 24, 2009; Had a couple of drinks.
Saw a couple of things.

Vulture Culture posted:

Have you verified that this is actually related to Storage vMotion, and not a recent Windows update or something? Since you're at 100% already, it should be easy enough to reproduce by svMotioning a clean VM along the same path.

yeah just from the sound of it I don't think this is a storage issue, unless you have some other underlying issues

# ? Jun 5, 2015 19:53

adorai: Nov 2, 2002; 10/27/04 Never forget; Grimey Drawer

Anyone have any opinions of VMware NSX? We are going from a low trust network to a zero trust network with very small subnets, and it looks like it will make it a lot easier for the system administrators to participate in network changes without needing in depth knowledge of routing protocols or firewalls.

# ? Jun 11, 2015 01:41

some kinda jackal: Feb 25, 2003; �
�

On the fringe of virtualization:

Do you guys have a recommended devops solution for deploying and configuring VMware templates? Ideally I'd like an AIO solution where I have a tool I can use to fill out a form, push a button, and it'll provision the VMware template, then use puppet or chef or ansible to configure the OS layer. I can figure out the latter part of this myself, but I'm looking for something that will also use vSphere to provision the VM itself.

I'm basically in analysis paralysis. There are so many devops tools that every time I try to figure this out I end up with 1000 Chrome tabs open and more questions than I started with.

# ? Jun 11, 2015 17:50

FISHMANPET: Mar 3, 2007; Sweet 'N Sour
Can't
Melt
Steel Beams

PowerCLI is pretty powerful and can do basically anything you want to a machine.

# ? Jun 11, 2015 18:06

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Martytoof posted:

On the fringe of virtualization:

Do you guys have a recommended devops solution for deploying and configuring VMware templates? Ideally I'd like an AIO solution where I have a tool I can use to fill out a form, push a button, and it'll provision the VMware template, then use puppet or chef or ansible to configure the OS layer. I can figure out the latter part of this myself, but I'm looking for something that will also use vSphere to provision the VM itself.

I'm basically in analysis paralysis. There are so many devops tools that every time I try to figure this out I end up with 1000 Chrome tabs open and more questions than I started with.

Standard "that's not DevOps" rant aside, you basically have three kinds of generic packaged solutions going after the existing vSphere market:

VMware's value-added offerings in this space, like vCloud Automation Center
Service catalog portals on top of OpenStack/OpenNebula, like vOneCloud
Third-party orchestration systems, like RightScale

These all target different needs and different budget, and honestly, none of them are all that good and they'll all require more customization to be useful than you probably want. If we had some idea of what kind of services you're deploying and what the requirements are behind those services, we can assemble a better mental model of what it is that you actually need. Maybe Cloud Foundry would be a better fit for many of your workloads.

# ? Jun 11, 2015 18:16

minato: Jun 7, 2004; cutty cain't hang, say 7-up.; Taco Defender

Martytoof posted:

I'm basically in analysis paralysis. There are so many devops tools that every time I try to figure this out I end up with 1000 Chrome tabs open and more questions than I started with.

Maybe this will help: http://blog.circleci.com/its-the-future/

# ? Jun 11, 2015 18:30

some kinda jackal: Feb 25, 2003; �
�

Thanks, yeah I'm not terribly well versed in this area and devops is just a buzzword that seems to pop up so apologies for misusing it. Thanks for the info erryone.

minato posted:

Maybe this will help: http://blog.circleci.com/its-the-future/

That's basically me right now, yeah

# ? Jun 11, 2015 18:58

Docjowles: Apr 9, 2009

minato posted:

Maybe this will help: http://blog.circleci.com/its-the-future/

This part owns in particular:

quote:

-It means they�re poo poo. Like Mongo.

I thought Mongo was web scale?

-No one else did.

# ? Jun 11, 2015 19:05

mAlfunkti0n: May 19, 2004; Fallen Rib

Stuff like this makes me awestruck that we haven't been run over by the tank that is "IT". So much crap going on that people want to push as standards, to the point where it's almost impossible to learn.

# ? Jun 11, 2015 19:24

bull3964: Nov 18, 2000; DO YOU HEAR THAT? THAT'S THE SOUND OF ME PATTING MYSELF ON THE BACK.

minato posted:

Maybe this will help: http://blog.circleci.com/its-the-future/

Too close to home man, too close.

# ? Jun 11, 2015 20:07

jre: Sep 2, 2011; To the cloud ?

quote:

What about something on OpenStack?

-Ew.

Ew?

-Ew.

Evol is going to have a stroke when he reads that

# ? Jun 11, 2015 20:20

Docjowles: Apr 9, 2009

jre posted:

Evol is going to have a stroke when he reads that

No I'm pretty sure that's the reaction anyone who has had to actually work with OpenStack in real life has. It keeps getting better but it is still kind of a shitshow.

Not many other games in town if you want to run a private cloud, though!

# ? Jun 11, 2015 20:43

evol262: Nov 30, 2010; #!/usr/bin/perl

jre posted:

Evol is going to have a stroke when he reads that

The whole thing was Silicon Valley-ish. So true it starts to become depressing instead of funny. Especially as I'm actively working on containerizing a bunch of stuff that doesn't actually belong in containers because there are business drivers towards the buzz...

I also work on RHEV/oVirt, which helps.

Openstack is fun and all, but a lot of the developers have obviously never worked in a production environment, and the vast majority of people have never been admins who had to deal with real world poo poo. If you already have a GCE/AWS/Azure-centric workflow (meaning all the anonymous machines, devops-y, CI, etc bits), it can be fun. But it's not at all a replacement for traditional virt. We tell customers this, but they insist they want openstack. Then they get it and don't know what to do with it or how to make it work with their workflow.

Part of me thinks that the person saying "ew" to openstack probably does a bunch of janky non-cloud poo poo with AWS or has convinced themselves that vSphere/XenServer/Hyper-V+puppet is somehow just as good as "cloud" bits, but that's veering towards no true scotsman.

If you need openstack, you already know it. If you don't think you need openstack, you don't. I'm not offended by this at all. I do think it's funny how polarizing it is and how much everyone wants to follow the hype even if it's not remotely suitable for their business needs.

# ? Jun 11, 2015 21:19

DevNull: Apr 4, 2007; And sometimes is seen a strange spot in the sky
A human being that was given to fly

evol262 posted:

containers

Trigger warnings please.

There are people at VMware that think we should stop doing anything with VMs, and just be a container company instead. Containerware.

# ? Jun 11, 2015 21:25

madsushi: Apr 19, 2009; Baller.
#essereFerrari

DevNull posted:

Trigger warnings please.

There are people at VMware that think we should stop doing anything with VMs, and just be a container company instead. Containerware.

So the web client guys have moved on to containers, got it.

# ? Jun 11, 2015 22:03

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

adorai posted:

Anyone have any opinions of VMware NSX? We are going from a low trust network to a zero trust network with very small subnets, and it looks like it will make it a lot easier for the system administrators to participate in network changes without needing in depth knowledge of routing protocols or firewalls.

My experiences with NSX are mostly positive. It's biggest negatives are the price and the fact that you'll have to use the vSphere web client to manage it/get it up and running. Also VMware product management seems pretty keen to listen to feedback so the next version is going to fix a good chunk of things that annoy me about it.

With the way NSX does it's firewalling you may not even need a lot of small subnets and if you want to stick with that design you can at the very least avoid VLAN sprawl. Essentially NSX can let you enforce policy right at the vnic. It can enforce policy by port-group name, resource pool, identity, source and destination l2/l3 addresses/ports, naming conventions, whatever. You could create rules like "anything sourced from this port group with a destination to the same port group, drop it. only allow traffic to this app." Pretty helpful stuff when you want to just define firewalls once and not have to keep updating them later just because you had to grow the application.

I'll be happy to answer any direct questions you may have. I'm long overdue for an effort post on the subject.

# ? Jun 11, 2015 22:31

minato: Jun 7, 2004; cutty cain't hang, say 7-up.; Taco Defender

Evol, what did you think of this talk?

https://www.openstack.org/summit/vancouver-2015/summit-videos/presentation/openstack-is-doomed-and-it-is-your-fault

# ? Jun 12, 2015 00:14

Gucci Loafers: May 20, 2006; Ask yourself, do you really want to talk to pair of really nice gaudy shoes?

After reading over all the OpenStack and DevOps conversations, videos and bits I have come to the realization I have zero idea about any of these things.

# ? Jun 12, 2015 00:43

Gyshall: Feb 24, 2009; Had a couple of drinks.
Saw a couple of things.

madsushi posted:

So the web client guys have moved on to containers, got it.

Whoever thought the web client was a positive move away from the classic vSphere/ESX/i client interface should have their hands chopped off.

# ? Jun 12, 2015 02:21

Moey: Oct 22, 2010; I LIKE TO MOVE IT

Gyshall posted:

Whoever thought the web client was a positive move away from the classic vSphere/ESX/i client interface should have their hands chopped off.

This. I only go into the web client when needed.

# ? Jun 12, 2015 03:24

some kinda jackal: Feb 25, 2003; �
�

Moey posted:

This. I only go into the web client when needed.

As a Mac user, primarily, my life is a living hell every time I need to do any VM things, which is why I'm looking for a cool tool to do it for me :q:

Actually it turns out that Puppet + Foreman will do just about everything I need, including provisioning VMs so I'm going to have to look at the web UI less and less now :dance:

# ? Jun 12, 2015 03:37

adorai: Nov 2, 2002; 10/27/04 Never forget; Grimey Drawer

Moey posted:

This. I only go into the web client when needed.

You can do everything in powercli, and it's less painful to learn the commands the fly than it is to use the web ui.

# ? Jun 12, 2015 03:58

Inspector_666: Oct 7, 2003; _{benny with the good hair}

Tab8715 posted:

After reading over all the OpenStack and DevOps conversations, videos and bits I have come to the realization I have zero idea about any of these things.

This is exactly how I feel. It all sounds cool, but holy poo poo what the gently caress does any of it actually mean?

# ? Jun 12, 2015 05:00

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Inspector_666 posted:

This is exactly how I feel. It all sounds cool, but holy poo poo what the gently caress does any of it actually mean?

Here's everything you need to know in a few paragraphs.

DevOps is a method of delivering and running software that focuses on delivering value fast. Think of it this way: no matter what great new feature you've written for your software product (or, if you deal with implementing off-the-shelf software, the great feature you just paid some vendor a lot of money for), it is literally almost worthless to your company until it is in the hands of users. DevOps is a methodology and set of practices focused on shortening that gap.

Software used to be delivered in a way that meshed well with the waterfall method of software development. People planned big, monolithic releases of these giant software programs, wrote down thousands upon thousands of line items in an MS Project file, then forgot that it takes work to actually maintain and operate the software in production. So, they would fail to include operations in any of the requirements, toss this live grenade of a software package over the wall two or four times a year, and say "here, make this work." Ops would have two weeks, if they were lucky, to absorb months of changes, get them working in test environments, then move everything over to production and hope that things would work. If the development side of the house had their things in order, they would get identical or nearly-identical build artifacts to deploy. If not, you might get completely different packages in testing and production, built by completely different people, on completely different systems, using completely different processes. Because of the huge number of changes, there would be hours or days of downtime, and rollbacks would be impossible. By the time everything was done, the software that got delivered probably wasn't what the client wanted anyway. It didn't matter -- the project schedule was set in stone, and their changes will probably make it into the next release six months from now, codenamed Molotov Cocktail.

Enter the Agile software development trend in the early 2000s. Monolithic releases started to be largely replaced by software that adapted quickly to the changing needs of users, because those needs in turn reflected the changing needs of the business, which has to either respond to competitive threats or lose to another product. Instead of releasing every six months, they might release every month, or every two weeks. Things got better, but because people still had to put these things into production, there were lots of delays. Maybe the environment used to test wasn't quite the same as production, and things still blew up, because someone wanted a new version of some library on their dev boxes and it never made it into production. So you're still spending a lot of time working on back-and-forth exchanges with lots of finger pointing between developers and operations people.

DevOps is systems administration (a discipline with a rich legacy of PMI-style waterfall project management) catching up to the Agile way of doing things. Operations works with developers from the beginning stages of a product to ensure that development environments look just like production. The same methods are used to produce dev and test VMs that are used to produce production environments -- often some kind of code-driven configuration management system like Puppet or Chef or PowerShell Desired State Configuration. Software builds and deploys are automated from start to finish to ensure that, environment aside, the thing actually being tested in dev is the same thing that's being run in production. Continuous integration always runs automated tests so operations isn't inadvertently given broken software. Because test systems are always updated with real (sanitized) data, there's a big incentive to make sure that migrations are non-disruptive so they don't kill the dev team's work. Configuration options and feature flags are used to control things that might not be ready for prime time.

These things all increase the reliability of the system. Because releases are so similar to each other, it's very easy to test changes. Roll your updates out to a subset of your systems -- transparently running two or more versions in production at once -- and if your monitoring systems don't fire, your users don't complain, and your system performance still looks fine, you roll out everywhere else. If something goes wrong, you roll back to your previous version that's identical but doesn't have feature X yet.

Because each individual change is now so low-risk, changes become routine. Instead of deploying a new version once every six months, you're deploying twice a week, then once a day, then five times a day, then you deploy different services sixty times a day. Your features aren't sitting there collecting dust anymore. They're in the hands of users, who are actually extracting value from them and making you money.

This way of doing things requires a lot of changes from the infrastructure side. If you're programmatically building and deploying your software from an automated system, you can't sit around for two weeks while some IT paper-pusher approves your VMs. You need a certain amount of capacity, and you need some automated way to manage those systems. Furthermore, you can't have each app owner worrying about dumb bookkeeping things like lists of IP addresses. So, you adopt infrastructure management systems that abstract that away from you. You put your VMs on systems that allow you to add capacity elastically, rather than making people juggle VMs between datastores they don't care about on SAN hardware they don't care about. You automatically assign them IPs out of pools instead of accounting for them in a spreadsheet somewhere. You let them assign firewall rules programmatically, instead of waiting for a security team to get to the VM. These things are all characteristics of an Infrastructure-as-a-Service platform like OpenStack or its cloud superiors like Amazon Web Services, Google Compute Engine or Microsoft Azure -- you trade some of your infrastructure flexibility for the ability to standardize and automate huge parts of your platform, enabling your service delivery to work in ways it literally couldn't before.

Maybe you don't even care about VMs. Maybe you have an application with some kind of runtime, like Ruby or Python or Node.js, and it requires a backend database that someone has to run and back up and monitor, and who wants to manage all that poo poo? Enter Platform-as-a-Service products like Cloud Foundry or Heroku or Amazon Elastic Beanstalk, which allow you to just take some kind of application package and upload it somewhere and it worries about how to run the thing.

Vulture Culture fucked around with this message at 06:19 on Jun 12, 2015

# ? Jun 12, 2015 06:12

evol262: Nov 30, 2010; #!/usr/bin/perl

minato posted:

Evol, what did you think of this talk?

https://www.openstack.org/summit/vancouver-2015/summit-videos/presentation/openstack-is-doomed-and-it-is-your-fault

I wasn't at summit (going to linuxcon/kvmforum/containercon instead :suicide:

, but I'm gonna be in Seattle instead of Phoenix for a week in the middle of August, so that's a plus), but I already want to watch this based on the abstract. I'll report back tomorrow when I'm sober and I can digest the talk fully

I complain a lot on SH/SC about big vendors purposely crippling neutron and swift so they can push their own poo poo that the reference implementation could easily do if it weren't design by 20 committees and/or we (dev) were better at politicking.

But I also kind of think openstack is adrift. Not necessarily bad, but why is nova doing ha? How is that :cloud:

?

Someone should be pushing back against kubernetes really hard unless you need mesos, and any container which needs its own special networking poo poo, privileged containers, containers that need config management, containers that have "apt-get update" or "yum upgrade" anywhere, and all this other poo poo is business chasing hype instead of selling real solutions for real problems (actual anonymous vms are the solution to all these problems, ala openstack or aws or whatever; containers have their place).

I feel like openstack was the hype and got pulled in 60 directions to do everything without a clear goal, and now that other poo poo is drinking its hype milkshake, it's floundering trying to set out real use cases that are not "the future of virtualization" and limiting scope to poo poo that's useful and desirable. Like cleaning up the poo poo instead of adding a new incubator project every 6 months until it's ABCXYZaaS.

I'm guessing the talk covers the same ground, but I'll respond tomorrow...

# ? Jun 12, 2015 06:19

Mr Shiny Pants: Nov 12, 2012

madsushi posted:

So the web client guys have moved on to containers, got it.

Nice

loving webclient

Vulture Culture posted:

Here's everything you need to know in a few paragraphs.

snip..

This is great, thanks.

Mr Shiny Pants fucked around with this message at 10:00 on Jun 12, 2015

# ? Jun 12, 2015 09:54

bull3964: Nov 18, 2000; DO YOU HEAR THAT? THAT'S THE SOUND OF ME PATTING MYSELF ON THE BACK.

Vulture Culture's description could have been a word for word account as to how things have evolved where I am over the past 7 years.

In my previous job I was mostly sysadmin, but I also handled the code deployments for a SaaS company.

The brother of one of my coworkers there poached myself (and his brother) to work where I am now at another SaaS. My main responsibility at that point in time was the manage code deployments. However, the sysadmin that they had at the time was mostly a linux person and the company was moving in the direction of .NET, so the number of Windows servers was growing and growing so I was taking over their care and feeding as well. When I started there, we would do a handful of builds to QA a week and we would do a release to production once every 1.5 to 2 months. Those releases were done starting at 8pm and we would take the sites down completely and not expose them to the world again until primary QA was complete. Those releases were hell. DB updates were a chore since applying 8 weeks worth of changes never worked as smoothly as incremental changes several times a week. Also, invariably, some DB script or config change would get lost to time and have to be chased down.

Those were the dark times. We often wouldn't wrap up and take the sites live again until 1-3am and even then there were usually issues that would have to be corrected the next day. Everyone would be burnt out. No one felt good about releases.

Over the years we grew. Many many more projects were coming down the pipe. Much more dev was happening. Builds to QA were happening on a daily basis and patches were going into QA sometimes several times a day. We started moving away from our monolithic code base for our shared platform in favor of plug-able MVC sites. Aligning all customers to get new features on the same schedule every 6-8 weeks became untenable from a business perspective and testing perspective. So, we started supplementing the big feature releases with weekly SP releases compromised mostly of minor enhancements and bug fixes. Like filling an icecube tray though, the amount of stuff that wouldn't manage to fit in the main release was starting to spill out into these weekly SP releases so they started to grow slightly more complex. At some point we realized that we didn't have a clear line anymore between a "major take the sites down and stay all night release" and a quick afternoon push that was done while sites were live. So, at that point, we decided to nix the huge site down releases in favor of the smaller releases more frequently.

Over time afternoon releases turned into morning release so people were more fresh and had time to solve problems during the day and the weekly schedule backed off a bit to every two weeks so QA would have more time to validate changes. But we've struck a good balance now I think.

Operationally, myself and my team work side by side with the developers. They talk to us directly when they have a new deployment that has to happen and we try to provide them with every bit of support we can for them to test their deployment instructions themselves. That part is still a work in progress right now but improving. We also do everything we can to expose production level errors and metrics to them so the team involved with something having issues can jump right in and start taking a look at what's blowing up without having to wait for info to percolate down to them (NewRelic is amazing.)

On the not so great side, my team is still responsible for office level desktop support and 'muggle' software like Exchange as well as managing the internal office infrastructure. Those things tend to be time sinks. We have, at least, recognized that we need to change that up a bit and are starting to put together plans for dedicated internal help desk staff to deal with routine "My Outlook stopped working!" issues.

# ? Jun 12, 2015 15:47

Docjowles: Apr 9, 2009

bull3964 posted:

When I started there, we would do a handful of builds to QA a week and we would do a release to production once every 1.5 to 2 months. Those releases were done starting at 8pm and we would take the sites down completely and not expose them to the world again until primary QA was complete. Those releases were hell. DB updates were a chore since applying 8 weeks worth of changes never worked as smoothly as incremental changes several times a week. Also, invariably, some DB script or config change would get lost to time and have to be chased down.

Those were the dark times. We often wouldn't wrap up and take the sites live again until 1-3am and even then there were usually issues that would have to be corrected the next day. Everyone would be burnt out. No one felt good about releases.

You're giving me flashbacks to my first job out of college. It was a SaaS startup in Boston (although I don't think that term existed or was popular yet) in the mid 2000's. I was junior sysadmin/QA/helldesk bitch, but I ended up working very closely with the developers to manage the test environment as well as their build/CI server (CruiseControl, IIRC). The company actually had a lot of the DevOps culture stuff down; dev and ops had a really outstanding relationship. Everyone was very friendly, and communicated well on the status of things.

But the development model was entirely oldschool waterfall, and holy hell did those quarterly release days suck. We'd come in on Saturday morning, take the site down for maintenance, and often be there 12 hours working non-stop to deploy, upgrade, test and hotfix the inevitable show stopping bugs. There was always, always some major screwup with the database migration, as you said. To their credit, the devs were there, too. So it wasn't quite the same "throw a live grenade over the wall" scenario. But that was small comfort when I'd lose a whole weekend to release day bullshit.

I am so, so glad the industry is moving away from that style of work. Agile and DevOps have been a godsend :angel:

# ? Jun 12, 2015 16:48

bull3964: Nov 18, 2000; DO YOU HEAR THAT? THAT'S THE SOUND OF ME PATTING MYSELF ON THE BACK.

Yeah, we always had dev around too during the big releases so we were never at the live grenade stage of things. Though, sometimes having dev around was a curse too since rollback was never ever an option because we had the personal on hand to 'fix' the issue. Never mind that they were sleep deprived and wrung out and not thinking clearly anymore.

I would say that there was a 50% chance that the last minute fix would either need to be fixed or rolled back the next day.

# ? Jun 12, 2015 17:04

Docjowles: Apr 9, 2009

Also, you can bet your rear end I do my best during interviews for Ops jobs now to make sure I don't get back into that situation. "What does your plan/build/test/deploy pipeline look like? How often do you release new code to production? What sort of config management tools are you using? Do those same tools configure the test environments? What's the relationship like between dev and ops? What sort of procedures do you go through during and after a major site outage?"

If the answer to any of these is "uh... huh huh huh... uh... :beavis:" that's a giant red flag that they're stuck in an old, lovely way of operating. Unless the job is specifically to help them get out of that, which could be a lot of fun if they are actually serious about it.

That's actually an angle I hadn't really thought about til now. Not only are companies that do giant waterfall releases a couple times a year taking forever to deliver that value to users. They're also driving away talented candidates who have no desire to spend 16 hour days firefighting after every deploy. Or carry a pager 24/7 because the code is a steaming pile that falls over at the drop of a hat, and that's somehow Ops' problem.

# ? Jun 12, 2015 17:29

evol262: Nov 30, 2010; #!/usr/bin/perl

minato posted:

Evol, what did you think of this talk?

https://www.openstack.org/summit/vancouver-2015/summit-videos/presentation/openstack-is-doomed-and-it-is-your-fault

So, I watched this. He started off really well. The point of "plugins" is "we don't like your implementation". The point of contrib is "we don't want to write or maintain this". That stuff shouldn't be in the repo. There is a lot of deprecated poo poo that we could/should pull out (though pulling support in the real world with real customers is always tricky, and users who have been there since Essex and have upgraded through may actually have a lot of legacy poo poo in their config files.

He's advocating basically the same bullshit Vulture Culture and I were talking about recently. "I've never had to manage anything in a production environment, so as long as I remove this, and it passes my test suite on my dev box/jenkins builder, it must be fine!"

Real customers in the real world have environment that, while not necessarily "legacy", are not rebuilt from scratch every time they upgrade to a new release. VMs and services in your own environment probably do not get routinely obsoleted by new trends with users pushed off, like AWS. You set it up, it works, it's crucial for your business, and you probably don't have the money, infrastructure, time, or willpower to build an identical environment to current best practice and swap over.

In reality, dropping a bunch of poo poo from Nova would cause a massive headache for a lot of companies. And probably Docjowles.

Replacing Nova is an option. But the end of the talk just turned into :jerkbag:

bullshit. "Oh, I've been at multiple startups, and I only want to write code that's exciting. gently caress your business drivers! HackerNews thinks that the cloud is new re-implementing jumpstart+FLARs deploying zones on bare metal! We should be doing that!" Nevermind that there are already projects which are intended for deploying containers on bare VMs or metal. Like Atomic and Kubernetes and CoreOS. And we don't need to be and shouldn't be pursuing that. He's advocating exactly what I feel the problem with Openstack is. We're just chasing bullshit all the time.

If you want to write a clean API to make it easy to deploy, then do it. That's great. That's the takeaway. But you can't keep running around like a dog chasing a rabbit towards whatever the trend has been for the last 3 weeks and hope to establish any kind of real product as a standard if you do that.

I rate his talk 1/10, but it's a solid 10/10 for pandering to the crowd and what they think would be super cool at this very moment.

# ? Jun 12, 2015 17:45

DevNull: Apr 4, 2007; And sometimes is seen a strange spot in the sky
A human being that was given to fly

evol262 posted:

We're just chasing bullshit all the time.

This pretty much sums up a lot of people at any company. They want to chase the hottest trends to try and get a promotion. They end up just creating a mess for someone else to cleanup, since they end up leaving after a year or two. Sadly, they get promoted in the process and the people that clean up don't.

# ? Jun 12, 2015 17:50

Docjowles: Apr 9, 2009

evol262 posted:

Real customers in the real world have environment that, while not necessarily "legacy", are not rebuilt from scratch every time they upgrade to a new release. VMs and services in your own environment probably do not get routinely obsoleted by new trends with users pushed off, like AWS. You set it up, it works, it's crucial for your business, and you probably don't have the money, infrastructure, time, or willpower to build an identical environment to current best practice and swap over.

Heh, this is actually pretty much how we upgraded from Grizzly to Icehouse. Since AFAIK an in-place upgrade was not supported at all at the time. We had some spare machines, built up a couple new control and compute nodes from scratch running Icehouse. Then we slowly tore down VM's on the old blade chassis and rebuilt them on the new ones. Thankfully we automate almost the entire build process, so this wasn't a completely insane amount of work. Once a whole blade was vacated, we rebuilt it on Icehouse too and added it to the compute pool. Eventually, everything was migrated over in that fashion.

It loving sucked. When/if we decide to upgrade again, I really hope the in-place stuff works! I'm sure there will be some configs that need tweaking, but I have zero desire to go through that exercise again.

# ? Jun 12, 2015 18:00

bull3964: Nov 18, 2000; DO YOU HEAR THAT? THAT'S THE SOUND OF ME PATTING MYSELF ON THE BACK.

We had the same problem with our API team for a bit. They would decide to refactor something on a whim

Team: "If we do this, the security model is so much more flexible than before, we can do this that and the other thing now!"

Me: "Yes, but you didn't test every edge case and you broke interfaces with such and such because you neglected to inform them of your changes. You also checked this in halfway through this testing cycle and QA has no requirements or defects to associate with to create a test plan and sign off on the implementation."

VP of Software development: "Knock this poo poo off. I'm fine with you guys exploring new tech, but there's no business driver for that flexibility and you burned dev, test, and now troubleshooting hours chasing goals we don't have right now. Revert your changes NOW."

Fortunately, they are better now and we haven't had an incident like this in awhile. Some people just want to chase new and shiny over business needs and need to get themselves a bit more grounded in reality.

# ? Jun 12, 2015 18:05

minato: Jun 7, 2004; cutty cain't hang, say 7-up.; Taco Defender

Still on Essex. If it ain't broke...

# ? Jun 12, 2015 18:06

evol262: Nov 30, 2010; #!/usr/bin/perl

Openstack has a really, really thorough test suite, so there aren't a lot of edge cases of breakage in current code from changes.

The problem is kind of that, by necessity, that thorough test suite runs in clean machines (or VMs, for TripleO) managed by Jenkins, so "legacy" configuration file changes aren't tested.

But in a general sense, breakage isn't the issue.

What is the issue is that openstack pretty much needs a couple of full-time guys to get it all running in the beginning (and maybe to keep it running), because a lot of the options aren't intuitive, there's three million config files, 20 different services, and 20+ logs. Not all of these config files are well documented (I mean, it's all documented somewhere online, but that doesn't help you when you're looking at nova.conf and it's 2400 lines with blank lines removed. How the gently caress do you have any idea how to even begin configuring that without a great installer? Which we don't have, by the way.

So you change a value somewhere and something breaks. There were 14 different API calls and 9 messages going through the message bus just to start a VM last time I looked at it. That's an abstracted architecture that makes it easy for dev. It's a total loving nightmare for admins.

It all needs to be simpler. It needs to stabilize enough for real documentation about how to make it work comes out without bringing an openstack expert on site for a week to get you up and running. Containers succeed because they're easy. Mesos/hadoop/kubernetes/openstack/etc flop (relatively -- all are obviously successful at a certain scale, and everyone wants to be CERN/Twitter/whomever) because they're absurdly complex, and designed to solve complex problems, but how to make it work for normal-scale businesses gets totally missed somewhere, and that target gets left further behind the more we all chase integration with zookeeper or spark or whatever's hip in 6 months.

# ? Jun 12, 2015 18:51

Adbot: ADBOT LOVES YOU

# ? May 25, 2024 19:08

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

evol262 posted:

Openstack has a really, really thorough test suite, so there aren't a lot of edge cases of breakage in current code from changes.

The problem is kind of that, by necessity, that thorough test suite runs in clean machines (or VMs, for TripleO) managed by Jenkins, so "legacy" configuration file changes aren't tested.

But in a general sense, breakage isn't the issue.

What is the issue is that openstack pretty much needs a couple of full-time guys to get it all running in the beginning (and maybe to keep it running), because a lot of the options aren't intuitive, there's three million config files, 20 different services, and 20+ logs. Not all of these config files are well documented (I mean, it's all documented somewhere online, but that doesn't help you when you're looking at nova.conf and it's 2400 lines with blank lines removed. How the gently caress do you have any idea how to even begin configuring that without a great installer? Which we don't have, by the way.

So you change a value somewhere and something breaks. There were 14 different API calls and 9 messages going through the message bus just to start a VM last time I looked at it. That's an abstracted architecture that makes it easy for dev. It's a total loving nightmare for admins.

It all needs to be simpler. It needs to stabilize enough for real documentation about how to make it work comes out without bringing an openstack expert on site for a week to get you up and running. Containers succeed because they're easy. Mesos/hadoop/kubernetes/openstack/etc flop (relatively -- all are obviously successful at a certain scale, and everyone wants to be CERN/Twitter/whomever) because they're absurdly complex, and designed to solve complex problems, but how to make it work for normal-scale businesses gets totally missed somewhere, and that target gets left further behind the more we all chase integration with zookeeper or spark or whatever's hip in 6 months.

I fought with random OpenStack issues where random commands would fail on random nodes all the time, for several months, until I found a spot buried deep in the docs at the very end of their RabbitMQ HA guide that says "oh by the way, only these four services support RabbitMQ in HA active/active configs :bang:

Now I've got some hosed up Frankenstein monstrosity that's kinda-sorta-active-active because I didn't plan on motherfucking DRBD being in the equation when I partitioned the disks on the cloud controller nodes.

Whoever is responsible for organizing their documentation needs to be burned alive between Winterfell and The Wall.

Vulture Culture fucked around with this message at 23:18 on Jun 12, 2015

# ? Jun 12, 2015 23:16

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Virtualization Megathread V2: VMs inside VMs

«‹›313 »