Virtualization Megathread V2: VMs inside VMs

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Virtualization Megathread V2: VMs inside VMs

«‹›313 »

parid: Mar 18, 2004

June was a bad month for my VMware clusters. We had a series (at least 3) of network outages that prevented many of my hosts from talking to their storage (all NFS). Every time this happens I have to take time to get vCenter back up before we can dig in fixing the rest of the VMs. In on case, this was an extra hour of delay.

I'd love to fix the root of the problem (the network) but its out of my hands with another team. What I do have control over is how we have vCenter implemented. Right now the vcenter/sso server is physical. I had plans to virtualize it very soon. The SQL server for it is already a VM.

Making vCenter a VM is fine for many failure scenarios, but the kind of disruptive network failure were seeing more of would be challenging to deal with. I would like to find a way to make it multi-site. Or at least have some kind of cold standby if the app in the primary datacenter fails.

What is everyone doing to ensure the availability of their vCenter? It looks like VMware is dropping heartbeat soon. I see vCenter supports SQL clustering and microsoft cluster services now. Is this additional headache worth considering?

# ? Jul 2, 2014 17:43

Adbot: ADBOT LOVES YOU

# ? May 20, 2024 04:50

mAlfunkti0n: May 19, 2004; Fallen Rib

parid posted:

June was a bad month for my VMware clusters. We had a series (at least 3) of network outages that prevented many of my hosts from talking to their storage (all NFS). Every time this happens I have to take time to get vCenter back up before we can dig in fixing the rest of the VMs. In on case, this was an extra hour of delay.

I'd love to fix the root of the problem (the network) but its out of my hands with another team. What I do have control over is how we have vCenter implemented. Right now the vcenter/sso server is physical. I had plans to virtualize it very soon. The SQL server for it is already a VM.

Making vCenter a VM is fine for many failure scenarios, but the kind of disruptive network failure were seeing more of would be challenging to deal with. I would like to find a way to make it multi-site. Or at least have some kind of cold standby if the app in the primary datacenter fails.

What is everyone doing to ensure the availability of their vCenter? It looks like VMware is dropping heartbeat soon. I see vCenter supports SQL clustering and microsoft cluster services now. Is this additional headache worth considering?

Ignore me .. I can't read apparently

mAlfunkti0n fucked around with this message at 17:49 on Jul 2, 2014

# ? Jul 2, 2014 17:46

Pile Of Garbage: May 28, 2007

Welcome back Dilbert As gently caress.

# ? Jul 2, 2014 17:48

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

I have a contractor that is looking to provide hardware/software for an industrial control system. They're picking vmware as the platform, but with standalone free esxi hosts and no support contracts. My suspicion is that they have a support contract on their test lab and they re-create issues encountered there and then pass the results down to their customers. Is there any possible way this is not violating the terms of their vmware contract? They'd be reselling their support contract to their customers essentially and its suspicious as hell to me.

# ? Jul 2, 2014 17:52

mayodreams: Jul 4, 2003; Hello darkness,
my old friend

parid posted:

June was a bad month for my VMware clusters. We had a series (at least 3) of network outages that prevented many of my hosts from talking to their storage (all NFS). Every time this happens I have to take time to get vCenter back up before we can dig in fixing the rest of the VMs. In on case, this was an extra hour of delay.

I'd love to fix the root of the problem (the network) but its out of my hands with another team. What I do have control over is how we have vCenter implemented. Right now the vcenter/sso server is physical. I had plans to virtualize it very soon. The SQL server for it is already a VM.

Making vCenter a VM is fine for many failure scenarios, but the kind of disruptive network failure were seeing more of would be challenging to deal with. I would like to find a way to make it multi-site. Or at least have some kind of cold standby if the app in the primary datacenter fails.

What is everyone doing to ensure the availability of their vCenter? It looks like VMware is dropping heartbeat soon. I see vCenter supports SQL clustering and microsoft cluster services now. Is this additional headache worth considering?

We just migrated our vCenter from a VM to a physical due to performance issues with our growing environment. Something to consider if you are looking to go VM.

# ? Jul 2, 2014 17:57

mAlfunkti0n: May 19, 2004; Fallen Rib

mayodreams posted:

We just migrated our vCenter from a VM to a physical due to performance issues with our growing environment. Something to consider if you are looking to go VM.

What kind of issues were you seeing?

# ? Jul 2, 2014 18:00

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

If you are expecting any kind of resource contention on your cluster then you should be setting up resources pools with cpu/memory/disk shares at a higher level for super-critical stuff like DCs and your vcenter VM. A minimum of two, probably 3. One high priority for that really important stuff, a one with a normal amount of shares for the regular stuff, and a low priority one for dev servers so memory/cpu pressure gets applied there first.

# ? Jul 2, 2014 18:02

mayodreams: Jul 4, 2003; Hello darkness,
my old friend

mAlfunkti0n posted:

What kind of issues were you seeing?

A lot of lag, slow console response, a lot of load on the host it was residing on. We have multiple datacenters at different locations, and it was becoming difficult to admin them. Moving to a physical with more ram and procs, and with fast local storage, helped out a lot. We are considering moving the DB to a physical too at some point. This is a 2008 R2 server with vCenter installed, not the appliance. We have around 400 or so VM's for reference.

# ? Jul 2, 2014 18:09

mAlfunkti0n: May 19, 2004; Fallen Rib

mayodreams posted:

A lot of lag, slow console response, a lot of load on the host it was residing on. We have multiple datacenters at different locations, and it was becoming difficult to admin them. Moving to a physical with more ram and procs, and with fast local storage, helped out a lot. We are considering moving the DB to a physical too at some point. This is a 2008 R2 server with vCenter installed, not the appliance. We have around 400 or so VM's for reference.

I understand your reason for moving to a physical, but I don't know why it would cause alarm with someone considering moving to a VM. The same would be the case if your physical server hosting vCenter had resource contention issues caused by an application running on it. We have close to 1000 VMs on 30 hosts and vCenter runs just fine, but it stays in a small cluster which only handles management duties (vcenter, sso, db, vcops, etc).

# ? Jul 2, 2014 18:17

luminalflux: May 27, 2005

I just applied for the beta. Is there a good guide to setting up a smallish VMware cluster on a single host for testing purposes?
And is there a small form-factor machine I can set up a lab on (mac mini, intel NUC sized) easily?

# ? Jul 2, 2014 18:25

Dilbert As FUCK: Sep 8, 2007; by Cowcaster; Pillbug

BangersInMyKnickers posted:

I have a contractor that is looking to provide hardware/software for an industrial control system. They're picking vmware as the platform, but with standalone free esxi hosts and no support contracts. My suspicion is that they have a support contract on their test lab and they re-create issues encountered there and then pass the results down to their customers. Is there any possible way this is not violating the terms of their vmware contract? They'd be reselling their support contract to their customers essentially and its suspicious as hell to me.

I don't think there is anything hard stating that you can't do that but it's sketchy as hell and odd.

Is the 500 dollar essentials kit too much to ask for?

luminalflux posted:

I just applied for the beta. Is there a good guide to setting up a smallish VMware cluster on a single host for testing purposes?
And is there a small form-factor machine I can set up a lab on (mac mini, intel NUC sized) easily?

When you get accepted they have a special form with how to's and some stuff to test out.

Dilbert As FUCK fucked around with this message at 19:09 on Jul 2, 2014

# ? Jul 2, 2014 19:02

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

Dilbert As gently caress posted:

I don't think there is anything hard stating that you can't do that but it's sketchy as hell and odd.

Is the 500 dollar essentials kit too much to ask for?

apparently: yes. and I'm not thrilled with the idea of using this guys as a middleman if we encounter an issue that is clearly in vmware's court. I don't care how much "validation" they do in their lab

# ? Jul 2, 2014 19:05

evol262: Nov 30, 2010; #!/usr/bin/perl

Dilbert As gently caress posted:

I don't think there is anything hard stating that you can't do that but it's sketchy as hell and odd.

IANAL, but this is against the terms of every support agreement I've ever seen, including ours. Their license and support agreement specifically bans 'X'-aaS and transferral of the license, which includes "reselling their support".

If they can reproduce the issue on their own hardware, they can certainly open a support case. But you can bet that they'll get their license terminated if VMware finds out they're doing that.

# ? Jul 2, 2014 21:19

Moey: Oct 22, 2010; I LIKE TO MOVE IT

mayodreams posted:

A lot of lag, slow console response, a lot of load on the host it was residing on. We have multiple datacenters at different locations, and it was becoming difficult to admin them. Moving to a physical with more ram and procs, and with fast local storage, helped out a lot. We are considering moving the DB to a physical too at some point. This is a 2008 R2 server with vCenter installed, not the appliance. We have around 400 or so VM's for reference.

Eh, I am running about the same sized environment with no issues. Virtualized vCenter and SQL.

# ? Jul 2, 2014 22:57

Erwin: Feb 17, 2006

mayodreams posted:

A lot of lag, slow console response, a lot of load on the host it was residing on. We have multiple datacenters at different locations, and it was becoming difficult to admin them. Moving to a physical with more ram and procs, and with fast local storage, helped out a lot. We are considering moving the DB to a physical too at some point. This is a 2008 R2 server with vCenter installed, not the appliance. We have around 400 or so VM's for reference.

Are you sure the original vCenter VM hadn't been upgraded a couple of times, say 4.1 to 5 to 5.1 to 5.5, and the new physical one was built from scratch?

# ? Jul 2, 2014 23:00

Nitr0: Aug 17, 2005; IT'S FREE REAL ESTATE

mayodreams posted:

A lot of lag, slow console response, a lot of load on the host it was residing on. We have multiple datacenters at different locations, and it was becoming difficult to admin them. Moving to a physical with more ram and procs, and with fast local storage, helped out a lot. We are considering moving the DB to a physical too at some point. This is a 2008 R2 server with vCenter installed, not the appliance. We have around 400 or so VM's for reference.

Your problems sound like they're storage related.

# ? Jul 2, 2014 23:27

Docjowles: Apr 9, 2009

Nitr0 posted:

Your problems sound like they're storage related.

The Something Awful Forums > Discussion > Serious Hardware / Software Crap > Your problems sound like they're storage related.

Nice, you've condensed half of the posts in SH/SC into one thread.

# ? Jul 2, 2014 23:33

luminalflux: May 27, 2005

Docjowles posted:

The Something Awful Forums > Discussion > Serious Hardware / Software Crap > Your problems sound like they're storage related.

Nice, you've condensed half of the posts in SH/SC into one thread.

The other half is "blame the network"

# ? Jul 3, 2014 01:04

mAlfunkti0n: May 19, 2004; Fallen Rib

Nitr0 posted:

Your problems sound like they're storage related.

Not sure if you're just being silly or what.

Either way, he/she stated the host the VM resided on was quite busy, and in those cases slowness can come from many places. Was the server busy waiting on CPU scheduling? Was memory ballooning a problem? More than not though I hear "it must be a network problem!" more so than the storage bit.

luminalflux posted:

The other half is "blame the network"

Yup. I feel so bad for our network team. Even the team I am part of sends some completely stupid things to them. We are a Cisco shop and have CDP enabled, yet the guys frequently send requests over with no information about switch/port info, AND ITS SO EASY TO GET...

Honestly I don't know how they survive sometimes.

# ? Jul 3, 2014 01:27

Nitr0: Aug 17, 2005; IT'S FREE REAL ESTATE

Vcenter takes such little amounts of ram and cpu that if you are running into cpu and ram ballooning then you've got way bigger problems.

Storage! Storage! Storage!

# ? Jul 3, 2014 01:36

mAlfunkti0n: May 19, 2004; Fallen Rib

Nitr0 posted:

Vcenter takes such little amounts of ram and cpu that if you are running into cpu and ram ballooning then you've got way bigger problems.

Storage! Storage! Storage!

How big of an environment do you manage? I've seen CPU usage and RAM usage go plenty high numerous times. Also, "you have way bigger problems" sure, they just might which is one of the things I pointed out in my earlier post.

# ? Jul 3, 2014 01:43

Nitr0: Aug 17, 2005; IT'S FREE REAL ESTATE

1000vm, 40+ hosts

# ? Jul 3, 2014 02:13

mAlfunkti0n: May 19, 2004; Fallen Rib

We're around similar numbers then. I suppose if things are run right (is there such a place?) vCenter won't go nuts from time to time. Sadly, that's not a place I have ever worked at.

# ? Jul 3, 2014 02:16

YOLOsubmarine: Oct 19, 2004; When asked which Pokemon he evolved into, Kamara pauses.

"Motherfucking, what's that big dragon shit? That orange motherfucker. Charizard."

If your VCenter VM is performing badly it's because you don't have enough host or storage resources, not because VCenter runs badly as a VM. Capacity planning is important!

# ? Jul 3, 2014 02:44

mayodreams: Jul 4, 2003; Hello darkness,
my old friend

I don't know the full history, but I know it was upgraded from 4.1 -> 5.1 -> 5.5.

We only have 6 good hosts (DL380 G8) and 2 older G6's we are looking to replace with G8's. We have a ton of aging hardware and ballooning needs for servers and QA testing, so we are overwhelming our infra faster than we can build it. Storage isn't bad, but we are using a Nexenta based system from Pogo rather than a Netapp/etc. Still 4gb fiber x2 to the hosts though. It was previously iSCSI on gigabit which really REALLY loving ugly.

tldr: company doesn't want to really commit to upgrading ancient hardware and software, which is evidenced in our continued deployment of new desktops with XP and Novell authentication with eDirectory on Netware. I wish I were kidding. :suicide:

# ? Jul 3, 2014 04:08

Nitr0: Aug 17, 2005; IT'S FREE REAL ESTATE

lol I went and looked that up and it's basically a whitebox system. I configured one and by default it stuck in a bunch of slow 7200rpm disk.

So depending on how much money your company spent your storage probably sucks.

WHOS GONNA DISAGREE WITH ME NOW, HUH?! THATS RIGHT.

# ? Jul 3, 2014 04:57

Nitr0: Aug 17, 2005; IT'S FREE REAL ESTATE

Run a benchmark of that thing I want to see!

# ? Jul 3, 2014 04:59

Sickening: Jul 16, 2007; Black summer was the best summer.

mayodreams posted:

I don't know the full history, but I know it was upgraded from 4.1 -> 5.1 -> 5.5.

We only have 6 good hosts (DL380 G8) and 2 older G6's we are looking to replace with G8's. We have a ton of aging hardware and ballooning needs for servers and QA testing, so we are overwhelming our infra faster than we can build it. Storage isn't bad, but we are using a Nexenta based system from Pogo rather than a Netapp/etc. Still 4gb fiber x2 to the hosts though. It was previously iSCSI on gigabit which really REALLY loving ugly.

tldr: company doesn't want to really commit to upgrading ancient hardware and software, which is evidenced in our continued deployment of new desktops with XP and Novell authentication with eDirectory on Netware. I wish I were kidding.

In case anybody is wondering, one of these is probably the storage being discussed.

http://www.pogostorage.com/products/nexenta_appliances

8 hosts with %mystery% number of vm's is probably not a good sign that Nitr0 is wrong.

# ? Jul 3, 2014 05:30

mAlfunkti0n: May 19, 2004; Fallen Rib

Nitr0 posted:

WHOS GONNA DISAGREE WITH ME NOW, HUH?! THATS RIGHT.

I'LL ARGUE WITH YOU JUST TO ARGUE! :smuggo:

But anyways, yes, if it's just a bank of 7200's .. I pity him.

mayodreams, tell me you do follow the rightsizing concept .. right? At least get the stuff you can fix out of the way.

mAlfunkti0n fucked around with this message at 13:36 on Jul 3, 2014

# ? Jul 3, 2014 13:31

mayodreams: Jul 4, 2003; Hello darkness,
my old friend

I think the VM storage pool is 10k drives in raid10, but I know the mass storage is 7200k drives in a raid6. We are screaming for more hosts, but yeah storage is probably a problem for us. We have about 275 Windows (2003-2012 R2) and about 100 linux (RHEL 4/5/6) vms. We try to balance loads across the hosts manually, but haven't done it with any VMware tools.

# ? Jul 3, 2014 17:45

Sickening: Jul 16, 2007; Black summer was the best summer.

mayodreams posted:

I think the VM storage pool is 10k drives in raid10, but I know the mass storage is 7200k drives in a raid6. We are screaming for more hosts, but yeah storage is probably a problem for us. We have about 275 Windows (2003-2012 R2) and about 100 linux (RHEL 4/5/6) vms. We try to balance loads across the hosts manually, but haven't done it with any VMware tools.

That is a poo poo ton of vm's.

So when you said your storage was good you meant....?

Sickening fucked around with this message at 19:30 on Jul 3, 2014

# ? Jul 3, 2014 17:49

skipdogg: Nov 29, 2004; Resident SRT-4 Expert

Holy poo poo. We're nowhere near capacity, but in our California location, I'm running 151 VM's across 12 hosts, separated on 2 clusters to a VNX5500 with a poo poo load of disk. A full shelf of SSD, 15K shelves for VM's and NL-SAS for file storage. Somewhere in the neighborhood of 55TB total.

You poor bastard

# ? Jul 3, 2014 18:05

mAlfunkti0n: May 19, 2004; Fallen Rib

mayodreams posted:

I think the VM storage pool is 10k drives in raid10, but I know the mass storage is 7200k drives in a raid6. We are screaming for more hosts, but yeah storage is probably a problem for us. We have about 275 Windows (2003-2012 R2) and about 100 linux (RHEL 4/5/6) vms. We try to balance loads across the hosts manually, but haven't done it with any VMware tools.

Figuring even just a single CPU per VM ... it still hurts. I'd say turn DRS on but honestly I don't know that it would actually do anything unless you turned it to the most aggressive setting and even at that time the overhead from vMotions probably kills any benefit that would be there.

# ? Jul 3, 2014 18:15

Nitr0: Aug 17, 2005; IT'S FREE REAL ESTATE

"but I know the mass storage is 7200k drives in a raid6."

Oh man oh man oh man. No wonder VM blames storage whenever someone phones in with vcenter problems.

# ? Jul 3, 2014 21:30

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

Raid 6 is fine for bulk storage so long as you know what you're doing. A good storage controller handles the parity calcs and you get more capacity for a bunch of data that will get written and probably never accessed again because nobody cleans up their poo poo ever.

I'd much rather throw some ssd cache in to the array to speed things up if I needs it than run the risk of corrupt data on rebuild and having to restore who knows what out of the backup system.

# ? Jul 3, 2014 22:53

Dilbert As FUCK: Sep 8, 2007; by Cowcaster; Pillbug

SSD caching is good to help push the problem of low backend IOPS with caching but cache exhaustion, access to cold data, problems are real and can really cause some WTF moments if not calculated properly.

Don't get me wrong I love SSD caching, but I hate it when people go with like VNXe 3300 200GB FAST cache and 8x3TB drives@Raid 6 then wonder why things hit a wall.

It's also a bit of an annoyance when people don't optimize their 2008r2 templates for a virtual environment.

Dilbert As FUCK fucked around with this message at 23:25 on Jul 3, 2014

# ? Jul 3, 2014 23:22

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

Well yeah, but if your load is running sustained instead of bursty and not easily satisfied by SSD then why would you be putting it on SATA instead of SAS in the first place? I'm all for a cost-effective solution but I've also done the math on average rate to an unrecoverable read failure even with SAS drives and onces you're on the high capacity platters you're talking 1 in 10 odds of data corruption on rebuild with a raid 1 or 10 array. I cannot for the life of me understand why so many people are recommending raid 10 over raid 6 when it puts your balls so close to the bandsaw. I would much rather get the double parity with the performance hit because I can accelerate with SSD or more spindles rather than run the risk of data corruption.

# ? Jul 4, 2014 07:55

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

Give NetApp a big sack of money an watch your storage problems disappear: A Life Lesson

# ? Jul 4, 2014 07:57

adorai: Nov 2, 2002; 10/27/04 Never forget; Grimey Drawer

BangersInMyKnickers posted:

Raid 6 is fine for bulk storage so long as you know what you're doing.

Raid 6 is fine in general. We use only raid6 (well raidz2 and raid-dp) in our environment and generally everything runs quite well. We do follow the "shitload of cache" philosophy though.

# ? Jul 4, 2014 16:03

Adbot: ADBOT LOVES YOU

# ? May 20, 2024 04:50

Kachunkachunk: Jun 6, 2011

I'm not sure what to expect with bulk storage RAID-6 these days. At the moment I'm dealing with a slow VCAC provisioning issue where some NAS has a RAID-6 comprised of 'X' number of 7200-RPM drives and they're complaining that it's taking an hour to do 8 simultaneous clones. Read latency is averaging 100ms, and I'm starting to think this is probably to be expected at this stage.

Networking is 10Gbit end-to-end, but their NFS activity (re: from the NAS) is not over Jumbo Frames. We've tried single host and 8 separate hosts doing simultaneous clones, and results are similar enough to say it's probably not a networking bottleneck. During these clones, pings remain timely, but transfers still seem slower than they expect or hope. We never once reached anything close to 10Gbit speeds and doing line/speed testing is out of the question for now (great). I think I've seen maybe 2.5Gbit/s, at best, aggregated over several servers.

The destination storage for the writes (clone/provisioning destinations) is over 8Gbit FC, I believe... but even if it was 4Gbit, it's not hitting line rates, and those are proper 10K+ SAS disks for the destination. And with proper multipathing. And there's a ton of cache, so usually writes end up being quicker than reads (unless you fill it up completely with clones, but latency was typically quite decent for writes).

Any thoughts?

# ? Jul 4, 2014 16:56

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Virtualization Megathread V2: VMs inside VMs

«‹›313 »