Virtualization Megathread V2: VMs inside VMs

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Virtualization Megathread V2: VMs inside VMs

«‹›312 »

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

GrandMaster posted:

I thought the status within vSphere was dependent on the vSwitch having more than 1 NIC connected..

The status is based on being able to verify that you have two or more vmkernels going out over independent paths, both of which have access too all initiator IPs and their luns. It's the only way that the software can verify for itself that it has full redundancy. In Moey's case, he does have full redundancy in his configuration but the software just can't tell that because of how it is set up, so you get a report of partial/no redundancy. It's an easy enough fix though so long as you aren't already in production.

# ? Feb 1, 2013 06:41

Adbot: ADBOT LOVES YOU

# ? May 9, 2024 14:32

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

KS posted:

If you're not doing something special on those switches (like a VPC) then that's not really a good idea for iscsi traffic. With that diagram and traffic on a single subnet, are you relying on layer 2 redundancy and spanning tree? It's just not fast enough. Even if you are doing VPC and port-channels, IMO that's great for NFS but you should be relying on MPIO for iscsi.

Proper config would be:
Two distinct networks, one for each switch
Two adapters on the host, one in each network, connected to the appropriate switch
Two interfaces on the array, one in each network, connected to the appropriate switch

It's relying on MPIO in the iSCSI initiators. It's a recommended and supported configuration. There is no spanning tree because there are no loops between switches, just multiple endpoints and interconnected switches. The switch interconnect allows gives a bit more protection in a double-failure scenario in the event that a nic on the host and the nas both failed at the same time. Without it and you were running two independent switches side-by-side, if the nic on the left side of the host and the right side of the nas went out, you would completely lose storage traffic. The interconnect would allow the right host nic to talk to the left nas nic in that scenario. Not too likely, but it's easy, works, and is generally a better configuration.

e: It also allows you to transition to NFS with full redundancy much easier. We learned that the hard way when we had to take everything down to re-wire.

# ? Feb 1, 2013 06:46

KS: Jun 10, 2003; Outrageous Lumpwad

Yeah, thought about it a bit more and edited out the stp stuff, but it's still wrong. Without doing some sort of fixed path or weighted MPIO (and it'd be manual because nothing in the vmware plugin would detect this), you'd send half of your traffic from initiators on switch A to targets on switch B. You'd see link saturation on the cross-link and probably some out of order packet fun as well. Sure, it'll work, but it's far from optimal.

If you were doing it right to move to NFS, those switches would be in a VPC, iscsi would STILL be in independent VLANs on both switches, and you'd have a VLAN for NFS (or re-use ONE of the iscsi VLANs). ISCSI hosts would then get a link to each switch, and NFS hosts would get a port-channel to both switches. I can post NX-OS configs for this scenario.

BangersInMyKnickers posted:

The switch interconnect allows gives a bit more protection in a double-failure scenario in the event that a nic on the host and the nas both failed at the same time.

Not really, because you'd be sending 75% of your traffic (half of the normal and all of the failed) over the ISL in that scenario, and you'd kill it under any kind of load. The correct fix for this is multiple NICs in the array.

edit: the good news is, if you're really running this way you could gain a fair bit of performance when you fix it. I'd suggest monitoring traffic volume on your ISL. You CAN do a layer-2 redundant setup on one subnet, but it involves VPC capable switches and port-channels to the hosts. I don't really see any indication that that's going on in that diagram.

KS fucked around with this message at 07:17 on Feb 1, 2013

# ? Feb 1, 2013 06:55

Powdered Toast Man: Jan 25, 2005; TOAST-A-RIFIC!!!

I'm having some difficulty conceptually with how a hypervisor (specifically ESXi 5.1) works.

So, on a certain level, I understand that the hypervisor takes whatever resources are available to it and divides them between the virtual machines it is hosting. At the most basic level, this apparently means actual CPU cycles, which is why VMware measures "used performance" on a VM in Mhz. You can also give guests more than one virtual CPU, and this is where it gets a bit weird for me.

I've been told that giving a VM more cores doesn't necessarily make it faster. This seems counter-intuitive, especially with applications that are multithreaded and benefit from running on a physical box with an abundance of cores. So for some tasks, you might have a physical server with 8 cores, but a VM with only 2, and the VM will even perform better. This breaks my brain. What's going on here?

# ? Feb 1, 2013 07:38

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Powdered Toast Man posted:

I'm having some difficulty conceptually with how a hypervisor (specifically ESXi 5.1) works.

So, on a certain level, I understand that the hypervisor takes whatever resources are available to it and divides them between the virtual machines it is hosting. At the most basic level, this apparently means actual CPU cycles, which is why VMware measures "used performance" on a VM in Mhz. You can also give guests more than one virtual CPU, and this is where it gets a bit weird for me.

I've been told that giving a VM more cores doesn't necessarily make it faster. This seems counter-intuitive, especially with applications that are multithreaded and benefit from running on a physical box with an abundance of cores. So for some tasks, you might have a physical server with 8 cores, but a VM with only 2, and the VM will even perform better. This breaks my brain. What's going on here?

Let's clear something up: there is no situation I can think of where a 2-vCPU VM will perform better than an 8-CPU bare metal box with the same hardware underneath. Virtualization's not magic. What you do see all the time are situations where a 2-vCPU virtual machine outperforms an 8-vCPU virtual machine doing the same thing on the same hardware.

This is a really tiny thing, and it's all due to a minor little detail called co-scheduling.

You might have heard the term "SMP" thrown around when talking about multi-CPU systems. SMP stands for Symmetric Multi-Processing, and "Symmetric" is really the key to understanding this. As a physical system with multiple CPUs runs, those CPUs need to be kept in lock-step; this is why all the CPUs in a multi-socket system need to be identical. If one of the CPUs somehow falls out of sync with the others, the entire system will crash and burn. This is, for obvious reasons, something you want to avoid whether you're virtualizing or not.

So in the olden days of virtualization, hypervisors did exactly this: they virtualized SMP by literally running all of the virtual CPUs simultaneously on the same number of physical CPUs. If you had a 4-CPU machine, and you wanted to run a 4-CPU VM on it, the hypervisor would wait until all four CPUs were free before giving your VM its time slice, letting it run, and then resuming business as usual. So, the hypervisor has to context-switch the VMs running on those CPUs, wait until the last one is off and do nothing with those completely free CPUs in the meantime, then run the gigantic monster VM. Nothing else could run while that VM was scheduled to monopolize all the CPUs in the system. SMP virtual machines weren't just slow; they dragged down your entire virtual environment, and not by any small amount.

SMP really sucked in those olden days.

Nowadays, we have something called relaxed co-scheduling. Some smartypants engineers with VMware figured out that you don't really need to keep the CPUs completely in lockstep, you just need to keep them really, really close. This sounds like a really minor distinction, but it has a major architectural implication -- you don't need to run all your vCPUs at once anymore.

So now one CPU at a time can run, but if it gets too far ahead of the slowest CPU, it needs to hang back and wait for that CPU to catch up or bad things can happen. So now there's a lot more flexibility, but the devil's in the details. If you're running an SMP VM, now your CPUs need something to do. If one CPU is constantly running ahead of the other, you end up going start-stop-start-stop-start-stop and never really being able to run for more than a small fraction of the consecutive time that the VM is supposed to run. If you've dealt with OS internals at all, you know that too many context switches are a bad thing; in addition to the overhead of storing all those registers so you can run something else on the CPU, you're also killing your CPU cache in the process. Most hypervisors now implement some form of relaxed co-scheduling.

Relaxed co-scheduling reduces the overhead of scheduling an SMP VM from really loving bad to minimally tolerable in most cases. In the best case, a CPU-bound SMP VM churning at full tilt will incur almost no SMP overhead at all.

Bottom line: if you're using a properly multithreaded application, and not routinely processing a single-threaded batch job that blasts a single vCPU, you're fine with SMP (within reason). If you're running a small number of single-threaded workloads on a larger number of vCPUs, SMP will hurt your performance. Most operating systems in the Year of Our Lord 2013 support hot-adding CPUs, so as a general rule, start your VMs small and add CPUs as you need them. I still wouldn't recommend doing something like running an 8-vCPU VM on an 8-core box if you value your sanity.

Vulture Culture fucked around with this message at 08:32 on Feb 1, 2013

# ? Feb 1, 2013 08:26

Erwin: Feb 17, 2006

Powdered Toast Man posted:

So for some tasks, you might have a physical server with 8 cores, but a VM with only 2, and the VM will even perform better. This breaks my brain. What's going on here?

You're probably thinking of instances where four 2-vCPU VMs will outperform the one 8-core physical box. VMware has some Hadoop benchmark documents that show this (it's a very slight advantage). This is for the same reason that 4 2-core physical boxes would (probably) also outperform that 8-core box on certain tasks.

# ? Feb 1, 2013 12:16

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

KS posted:

Yeah, thought about it a bit more and edited out the stp stuff, but it's still wrong. Without doing some sort of fixed path or weighted MPIO (and it'd be manual because nothing in the vmware plugin would detect this), you'd send half of your traffic from initiators on switch A to targets on switch B. You'd see link saturation on the cross-link and probably some out of order packet fun as well. Sure, it'll work, but it's far from optimal.

If you were doing it right to move to NFS, those switches would be in a VPC, iscsi would STILL be in independent VLANs on both switches, and you'd have a VLAN for NFS (or re-use ONE of the iscsi VLANs). ISCSI hosts would then get a link to each switch, and NFS hosts would get a port-channel to both switches. I can post NX-OS configs for this scenario.

Not really, because you'd be sending 75% of your traffic (half of the normal and all of the failed) over the ISL in that scenario, and you'd kill it under any kind of load. The correct fix for this is multiple NICs in the array.

edit: the good news is, if you're really running this way you could gain a fair bit of performance when you fix it. I'd suggest monitoring traffic volume on your ISL. You CAN do a layer-2 redundant setup on one subnet, but it involves VPC capable switches and port-channels to the hosts. I don't really see any indication that that's going on in that diagram.

We handled the switch interconnect by bundling two links between them. Then you can do round-robin all day long if you want and you won't have saturation at that point anymore than any other place on the SAN. But there also isn't anything wrong with doing failover with affinity which would do the job as well, so that switch interconnect only gets used under a failure scenario. That configuration is extremely similar to your suggestion of two parallel switches, just with an added layer of redundancy.

# ? Feb 1, 2013 15:41

Moey: Oct 22, 2010; I LIKE TO MOVE IT

BangersInMyKnickers posted:

The status is based on being able to verify that you have two or more vmkernels going out over independent paths, both of which have access too all initiator IPs and their luns. It's the only way that the software can verify for itself that it has full redundancy. In Moey's case, he does have full redundancy in his configuration but the software just can't tell that because of how it is set up, so you get a report of partial/no redundancy. It's an easy enough fix though so long as you aren't already in production.

I am not taking any performance hit on this am I?

The SAN has two controllers with connections from each controller split between the two switches.
The hosts have their iSCSI connections split between NICs.
Each host has two vmkernels for iSCSI bound to their own NIC.
Pathing policy is set to round robin.

The "path" number to each datastore is correct with the number of logical paths (including crossing from switch A to switch B).

# ? Feb 1, 2013 16:45

Powdered Toast Man: Jan 25, 2005; TOAST-A-RIFIC!!!

So I assume that the latest ESXi is pretty good at co-scheduling...that being said, is it a bad idea to have more than twice as many vCPUs on guests as you have actual logical processors on the host?

# ? Feb 1, 2013 20:31

Dilbert As FUCK: Sep 8, 2007; by Cowcaster; Pillbug

Powdered Toast Man posted:

So I assume that the latest ESXi is pretty good at co-scheduling...that being said, is it a bad idea to have more than twice as many vCPUs on guests as you have actual logical processors on the host?

No, the over-provisioning of vCPU's is very common, it is fine to have more vCPU's than physical/logical cores, but be aware that too much over provisioning can cause performance derogation on the host and in term the virtual machines. You'll need to check your CPU ready and performance of virtual machines in order to find the load that your host can take.

Dilbert As FUCK fucked around with this message at 20:44 on Feb 1, 2013

# ? Feb 1, 2013 20:40

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

Moey posted:

I am not taking any performance hit on this am I?

The SAN has two controllers with connections from each controller split between the two switches.
The hosts have their iSCSI connections split between NICs.
Each host has two vmkernels for iSCSI bound to their own NIC.
Pathing policy is set to round robin.

The "path" number to each datastore is correct with the number of logical paths (including crossing from switch A to switch B).

No, you have have full gigabit (I assume) fabric through, and so long as you have enough target luns (probably 4 or more) the round robin should roughly balance out the traffic across your two paths without having to specify specific paths. If you stick with the /24 subnet mask, you might as well remove the switch interconnect. No traffic would be able to pass over it anyway because your subnet mask will stop it.

# ? Feb 1, 2013 20:49

Kachunkachunk: Jun 6, 2011

movax posted:

Giving this a try, but running into some issues. Had to set a different package path or something to get the compat modules to install, and looks like the VMXNET 3 modules aren't working properly in pfSense (won't get DHCP IPs, etc).

e: OK, got VMware tools installed, but the WAN VMXNET3 NIC won't get an IP whatsoever, and the LAN NIC won't respond to requests. Maybe I'll just stick with e1000 then.

I personally wouldn't really bother. Use a virtual NIC type that's natively supported by the kernel. Every time you update the system, it'll lose all the NICs (since the open-vm-tools package is not loaded, installed, or hosed off somewhere), forcing you you shut down, insert E1000 NICs, restart, set up networking, get connected to the Internet to redownload or reinstall open-vm-tools, then reinsert your VMXNET NICs.

After doing that several times in the past few years, I decided to stick with E1000s and suck up the additional hit in CPU usage.

I can't really speak for the completely manual VMware Tools (on BSD) install, but if pfSense updates still break poo poo, I'm staying far away.
E1000E is also similar enough if you yearn for the 10-Gbit rates, but generally speaking all the virtual NICs achieve far greater speeds than their advertised/negotiated rates in many cases.

For instance, between two 10Gbit VMXNET3 interfaces, I got 14.5 Gbit/s between two VMs on the same box, and 9.5 Gbit/s over the actual CNAs.

# ? Feb 1, 2013 21:08

Serfer: Mar 10, 2003; The piss tape is real

We had our EMC rep come by and give us a little pep-talk/update on whatever, blah blah blah. She brought along one of their VARs (side note: I loving hate it when companies do this) and they were giving me the whole "SANs are dead, in a couple years nobody will care, it'll just be a big, slow NAS with big flash caches on every host."

Anyway, after talking with them a little more, I didn't realize that they had gotten the application acceleration flash caches working in VMWare, with vMotion also working. But what I would like to know is what kind of real world performance increases have people seen when using it. Which products did you go with and why? I mean, I don't think I would go with EMC, because it locks you into their software, and their flash hardware, but maybe they're the best option?

# ? Feb 1, 2013 21:56

Powdered Toast Man: Jan 25, 2005; TOAST-A-RIFIC!!!

Corvettefisher posted:

No, the over-provisioning of vCPU's is very common, it is fine to have more vCPU's than physical/logical cores, but be aware that too much over provisioning can cause performance derogation on the host and in term the virtual machines. You'll need to check your CPU ready and performance of virtual machines in order to find the load that your host can take.

I should probably also mention that we're one host failure away from taking down an entire $500M company. Our redundancy is laughable. To be fair, the host servers are good quality hardware (Dell R710), but...it's like this small nagging voice in the back of my head is always telling me how totally hosed we would be if anything ever went wrong. It always cracks me up when people say things "aren't in the budget" but don't consider the possible costs of system failures.

# ? Feb 1, 2013 22:26

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

Serfer posted:

We had our EMC rep come by and give us a little pep-talk/update on whatever, blah blah blah. She brought along one of their VARs (side note: I loving hate it when companies do this) and they were giving me the whole "SANs are dead, in a couple years nobody will care, it'll just be a big, slow NAS with big flash caches on every host."

Anyway, after talking with them a little more, I didn't realize that they had gotten the application acceleration flash caches working in VMWare, with vMotion also working. But what I would like to know is what kind of real world performance increases have people seen when using it. Which products did you go with and why? I mean, I don't think I would go with EMC, because it locks you into their software, and their flash hardware, but maybe they're the best option?

Dell has been brewing up a lot of crazy stuff and their push to go private in partially due to their desire to do more skunkworks projects and actually produce something of their own instead of slapping their badges on things other people make for them. I have a feeling that the way we think of storage currently is going to change a lot over the next three to five years and that VAR isn't very far off. If you can put yourself in to a holding pattern for that long to see what is going to happen, I would recommend it.

# ? Feb 1, 2013 22:29

Dilbert As FUCK: Sep 8, 2007; by Cowcaster; Pillbug

Serfer posted:

We had our EMC rep come by and give us a little pep-talk/update on whatever, blah blah blah. She brought along one of their VARs (side note: I loving hate it when companies do this) and they were giving me the whole "SANs are dead, in a couple years nobody will care, it'll just be a big, slow NAS with big flash caches on every host."

Anyway, after talking with them a little more, I didn't realize that they had gotten the application acceleration flash caches working in VMWare, with vMotion also working. But what I would like to know is what kind of real world performance increases have people seen when using it. Which products did you go with and why? I mean, I don't think I would go with EMC, because it locks you into their software, and their flash hardware, but maybe they're the best option?

Some of the storage vendors are doing some pretty cool stuff with local to shared type stuff. Not to mention flash technology is really changing a lot of things.

Flash is loving amazing for VDI, EMC's FAST(seriously look this up) is almost a have to have on any VDI deploy due to how VDI environments operate. Fusion-IO is really amazing stuff as well with HA and deploying new desktops.

I have a feeling 'traditional' SAN's will still be around for quite some time for Enterprises, however the SMB market and how they address storage networks will change more dynamically in the coming years.

Here is a demo of 510 Virtual Desktops boots with a VNX and FAST.
https://www.youtube.com/watch?v=-qmxMVPqo_o

It's pretty loving cool

Dilbert As FUCK fucked around with this message at 22:52 on Feb 1, 2013

# ? Feb 1, 2013 22:45

Moey: Oct 22, 2010; I LIKE TO MOVE IT

BangersInMyKnickers posted:

No, you have have full gigabit (I assume) fabric through, and so long as you have enough target luns (probably 4 or more) the round robin should roughly balance out the traffic across your two paths without having to specify specific paths. If you stick with the /24 subnet mask, you might as well remove the switch interconnect. No traffic would be able to pass over it anyway because your subnet mask will stop it.

Full gigabit (I got to implement 10GbE at my old place

).

Maybe I am missing something but why do you say the switch interconnect does nothing? Everything is within the same /24 so I do not see why it wouldn't be routable over it.

# ? Feb 2, 2013 00:37

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

Moey posted:

Full gigabit (I got to implement 10GbE at my old place ).

Maybe I am missing something but why do you say the switch interconnect does nothing? Everything is within the same /24 so I do not see why it wouldn't be routable over it.

Initially you said that while you were using the /24 subnet, your IPs were in the 192.168.x.x range. Is that not actually the case?

# ? Feb 2, 2013 01:12

Serfer: Mar 10, 2003; The piss tape is real

BangersInMyKnickers posted:

Dell has been brewing up a lot of crazy stuff and their push to go private in partially due to their desire to do more skunkworks projects and actually produce something of their own instead of slapping their badges on things other people make for them. I have a feeling that the way we think of storage currently is going to change a lot over the next three to five years and that VAR isn't very far off. If you can put yourself in to a holding pattern for that long to see what is going to happen, I would recommend it.

Oh I agree with them that SAN systems are going to die out. Whether it's something like openstack that kills it with distributed storage, or host based flash caches that kill it. I just want to know if anyone has had any experience with any of the systems. Fusion-IO's turbo system, or EMC's VFCache, or even SanDisk's FlashSoft. I mean I know it's only been like 4 months since it started working with VMWare stuff, but some people are in the bleeding edge.

# ? Feb 2, 2013 01:28

Moey: Oct 22, 2010; I LIKE TO MOVE IT

BangersInMyKnickers posted:

Initially you said that while you were using the /24 subnet, your IPs were in the 192.168.x.x range. Is that not actually the case?

This is the case. Everything (SAN and hosts) are all in the same 192.168.x.x range. So in the picture that I drew, the host has 4 paths to the SAN. If I pull the switch interconnect, that host will only have 2 paths.

Edit: I said "routable" in my last post, that was the wrong term.

# ? Feb 2, 2013 02:24

FISHMANPET: Mar 3, 2007; Sweet 'N Sour
Can't
Melt
Steel Beams

BangersInMyKnickers posted:

Dell has been brewing up a lot of crazy stuff and their push to go private in partially due to their desire to do more skunkworks projects and actually produce something of their own instead of slapping their badges on things other people make for them. I have a feeling that the way we think of storage currently is going to change a lot over the next three to five years and that VAR isn't very far off. If you can put yourself in to a holding pattern for that long to see what is going to happen, I would recommend it.

They've bought companies to provide basically the entire IT infrastructure. I wouldn't be surprised if their last acquisition before going private is a business that reuses shipping containers, and then they'll just start drop shipping data centers ala Project Blackbox.

# ? Feb 2, 2013 04:24

adorai: Nov 2, 2002; 10/27/04 Never forget; Grimey Drawer

Powdered Toast Man posted:

So I assume that the latest ESXi is pretty good at co-scheduling...that being said, is it a bad idea to have more than twice as many vCPUs on guests as you have actual logical processors on the host?

We are somewhere around 5 to 1 on our server farm.

# ? Feb 2, 2013 05:01

Internet Explorer: Jun 1, 2005

But if you are overcommited 5 to 1 because each one of your virtual servers has the same number of vCPUs as your processors have cores... that may be a problem.

# ? Feb 2, 2013 05:45

Powdered Toast Man: Jan 25, 2005; TOAST-A-RIFIC!!!

Internet Explorer posted:

But if you are overcommited 5 to 1 because each one of your virtual servers has the same number of vCPUs as your processors have cores... that may be a problem.

I was under the impression that as long as you have enough physical memory, things might get really slow if you're heavily overcommited but it wouldn't necessarily tank your VMs.

As an example, we have a dual hex core box with 24 logical processors that has about 50 vCPUs on it. Everything runs pretty well. Then again, it has 128GB of RAM.

# ? Feb 2, 2013 05:55

adorai: Nov 2, 2002; 10/27/04 Never forget; Grimey Drawer

Internet Explorer posted:

But if you are overcommited 5 to 1 because each one of your virtual servers has the same number of vCPUs as your processors have cores... that may be a problem.

We are very conservative and only provision one vCPU per box unless we experience a need for additional. We regularly ignore vendor requirements for more than one. We have 6 8 core processors in our main server cluster and approximately 200 VMs. Of those 200, roughly 1/4 have 2 vCPUs, and a few have 3 vCPUs. We have 128GB of RAM per processor.

# ? Feb 2, 2013 06:04

Dilbert As FUCK: Sep 8, 2007; by Cowcaster; Pillbug

Powdered Toast Man posted:

I was under the impression that as long as you have enough physical memory, things might get really slow if you're heavily overcommited but it wouldn't necessarily tank your VMs.

As an example, we have a dual hex core box with 24 logical processors that has about 50 vCPUs on it. Everything runs pretty well. Then again, it has 128GB of RAM.

Generally ram is the factor that limits a box, however improper vCPU assignment can hinder what your host can acceptably sustain. Monitoring CPU ready and processing time is important, don't assign more than what you need or are recommended.

# ? Feb 2, 2013 06:41

MC Fruit Stripe: Nov 26, 2002; around and around we go

VMware lab question. I had mine all set up perfectly, but was in a real deadspot at work with nothing to do over the past few weeks so I just deleted the entire thing and started fresh. Fun stuff.

But now, I've got a few ESXi instances running, all connected through vcenter, and with openfiler playing SAN on the backend. Carved out a 200gb VMs LUN and presented it, connected it up, put my ISOs on a separate datastore, created the VM itself, ran it

Aaaannnddd tiiiimmmeee ssslllooowwweeeeeddd ddoooooowwwwnnnnnnnnnn

I've never seen Windows install this slowly, and I've never had this problem before, so I'm a little :gonk:

about it. I'm 20 minutes into a Server 2008 install, and it's at 8% on expanding files (not even installing).

Has anyone seen this before, or have suggestions? It is not a "yo poo poo is slow" problem, it is a "yo poo poo is hosed", but I can't figure out where it's hosed.

----

Problem #2 - I have moved 50gb worth of ISOs and data to a datastore which is hosted on my SAN. The files are there. The files are accessible. But where are the files? Like, I am browsing the hard drive of the lab box, and I swear to everything holy I can not find my god damned datastore. The SAN's vmdk files are a total of like 1gb. What is going on on this lab box? I am in bizarro world.

MC Fruit Stripe fucked around with this message at 07:46 on Feb 2, 2013

# ? Feb 2, 2013 07:26

evil_bunnY: Apr 2, 2003

Repost from the cisco thread, since half the problem is virtualization-related. Maybe one of you guys has a clue.

I'm once again stumped by my Nexus'es unwillingness to do my bidding.

The problem is that I can't get my port-channels to some new ESXi hosts to get/stay up.

Config: (indentical on both switches)

code:

Nex-One# sho run

interface port-channel11
  description esx01-nfs
  switchport mode trunk
  switchport trunk allowed vlan 739
  spanning-tree port type edge trunk
  speed 10000
  vpc 11

interface Ethernet1/11
  description esx01-nfs
  switchport mode trunk
  switchport trunk allowed vlan 739
  channel-group 11 mode active
  
Nex-Two# sho run

interface port-channel11
  description esx01-nfs
  switchport mode trunk
  switchport trunk allowed vlan 739
  spanning-tree port type edge trunk
  speed 10000
  vpc 11
  
interface Ethernet1/11
  description esx01-nfs
  switchport mode trunk
  switchport trunk allowed vlan 739
  channel-group 11 mode active

Port info

code:

Nex-One# show int port-channel 11
port-channel11 is down (No operational members)
 vPC Status: Down, vPC number: 11 [packets forwarded via vPC peer-link]
  Hardware: Port-Channel, address: 547f.eea0.8532 (bia 547f.eea0.8532)
  Description: esx01-nfs
  MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec
  reliability 255/255, txload 1/255, rxload 1/255
  Encapsulation ARPA
  Port mode is trunk
  auto-duplex, 10 Gb/s
  Input flow-control is off, output flow-control is off
  Switchport monitor is off 
  EtherType is 0x8100 
  Members in this channel: Eth1/11
  Last clearing of "show interface" counters never
  30 seconds input rate 0 bits/sec, 0 packets/sec
  30 seconds output rate 408 bits/sec, 0 packets/sec
  Load-Interval #2: 5 minute (300 seconds)
    input rate 0 bps, 0 pps; output rate 224 bps, 0 pps
  RX
    0 unicast packets  0 multicast packets  59 broadcast packets
    59 input packets  3776 bytes
    0 jumbo packets  0 storm suppression bytes
    0 runts  0 giants  0 CRC  0 no buffer
    0 input error  0 short frame  0 overrun   0 underrun  0 ignored
    0 watchdog  0 bad etype drop  0 bad proto drop  0 if down drop
    0 input with dribble  0 input discard
    0 Rx pause
  TX
    3035 unicast packets  6470472 multicast packets  36251 broadcast packets
    6509758 output packets  570220794 bytes
    0 jumbo packets
    0 output errors  0 collision  0 deferred  0 late collision
    0 lost carrier  0 no carrier  0 babble 0 output discard
    0 Tx pause
  0 interface resets

Nex-One# sh int ethernet 1/11
Ethernet1/11 is up
  Hardware: 1000/10000 Ethernet, address: 547f.eea0.8532 (bia 547f.eea0.8532)
  Description: esx01-nfs
  MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec
  reliability 255/255, txload 1/255, rxload 1/255
  Encapsulation ARPA
  Port mode is trunk
  full-duplex, 10 Gb/s, media type is 10G
  Beacon is turned off
  Input flow-control is off, output flow-control is off
  Rate mode is dedicated
  Switchport monitor is off 
  EtherType is 0x8100 
  Last link flapped 00:11:00
  Last clearing of "show interface" counters never
  30 seconds input rate 0 bits/sec, 0 packets/sec
  30 seconds output rate 328 bits/sec, 0 packets/sec
  Load-Interval #2: 5 minute (300 seconds)
    input rate 0 bps, 0 pps; output rate 224 bps, 0 pps
  RX
    0 unicast packets  0 multicast packets  59 broadcast packets
    59 input packets  3776 bytes
    0 jumbo packets  0 storm suppression bytes
    0 runts  0 giants  0 CRC  0 no buffer
    0 input error  0 short frame  0 overrun   0 underrun  0 ignored
    0 watchdog  0 bad etype drop  0 bad proto drop  0 if down drop
    0 input with dribble  0 input discard
    0 Rx pause
  TX
    3035 unicast packets  6470122 multicast packets  36247 broadcast packets
    6509404 output packets  570186480 bytes
    0 jumbo packets
    0 output errors  0 collision  0 deferred  0 late collision
    0 lost carrier  0 no carrier  0 babble 0 output discard
    0 Tx pause
  8 interface resets
  
Nex-One# show port-channel summary 
Flags:  D - Down        P - Up in port-channel (members)
        I - Individual  H - Hot-standby (LACP only)
        s - Suspended   r - Module-removed
        S - Switched    R - Routed
        U - Up (port-channel)
        M - Not in use. Min-links not met
--------------------------------------------------------------------------------
Group Port-       Type     Protocol  Member Ports
      Channel
--------------------------------------------------------------------------------
1     Po1(SU)     Eth      LACP      Eth1/1(P)    
2     Po2(SU)     Eth      LACP      Eth1/2(P)    
3     Po3(SD)     Eth      LACP      Eth1/3(D)    
4     Po4(SD)     Eth      LACP      Eth1/4(D)    
5     Po5(SD)     Eth      LACP      Eth1/5(D)    
6     Po6(SD)     Eth      LACP      Eth1/6(D)    
7     Po7(SD)     Eth      LACP      Eth1/7(D)    
8     Po8(SD)     Eth      LACP      Eth1/8(D)    
9     Po9(SD)     Eth      LACP      Eth1/9(D)    
10    Po10(SD)    Eth      LACP      Eth1/10(D)   
11    Po11(SD)    Eth      LACP      Eth1/11(I)   
12    Po12(SD)    Eth      LACP      Eth1/12(I)   Eth1/13(I)   
13    Po13(SD)    Eth      NONE      --
14    Po14(SD)    Eth      NONE      --
15    Po15(SD)    Eth      NONE      --
100   Po100(SU)   Eth      LACP      Eth1/30(P)   Eth1/31(P)   
101   Po101(SU)   Eth      LACP      Eth1/29(P)

I'm only worried about port-channel 11 so far (two links, on port 11 of both switches).

Log file:

code:

Nex-One# 
2013 Feb  4 14:19:33 Nex-One %ETHPORT-5-IF_DOWN_LINK_FAILURE: Interface Ethernet1/12 is down (Link failure)
2013 Feb  4 14:19:33 Nex-One %ETH_PORT_CHANNEL-5-PORT_INDIVIDUAL_DOWN: individual port Ethernet1/12 is down
2013 Feb  4 14:19:33 Nex-One %ETHPORT-5-IF_DOWN_LINK_FAILURE: Interface Ethernet1/11 is down (Link failure)
2013 Feb  4 14:19:33 Nex-One %ETH_PORT_CHANNEL-5-PORT_INDIVIDUAL_DOWN: individual port Ethernet1/11 is down
2013 Feb  4 14:19:34 Nex-One %ETHPORT-5-SPEED: Interface Ethernet1/12, operational speed changed to 10 Gbps
2013 Feb  4 14:19:34 Nex-One %ETHPORT-5-IF_DUPLEX: Interface Ethernet1/12, operational duplex mode changed to Full
2013 Feb  4 14:19:34 Nex-One %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface Ethernet1/12, operational Receive Flow Control state changed to off
2013 Feb  4 14:19:34 Nex-One %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface Ethernet1/12, operational Transmit Flow Control state changed to off
2013 Feb  4 14:19:34 Nex-One %ETHPORT-5-SPEED: Interface port-channel12, operational speed changed to 10 Gbps
2013 Feb  4 14:19:34 Nex-One %ETHPORT-5-IF_DUPLEX: Interface port-channel12, operational duplex mode changed to Full
2013 Feb  4 14:19:34 Nex-One %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface port-channel12, operational Receive Flow Control state changed to off
2013 Feb  4 14:19:34 Nex-One %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface port-channel12, operational Transmit Flow Control state changed to off
2013 Feb  4 14:19:35 Nex-One %ETHPORT-5-SPEED: Interface Ethernet1/11, operational speed changed to 10 Gbps
2013 Feb  4 14:19:35 Nex-One %ETHPORT-5-IF_DUPLEX: Interface Ethernet1/11, operational duplex mode changed to Full
2013 Feb  4 14:19:35 Nex-One %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface Ethernet1/11, operational Receive Flow Control state changed to off
2013 Feb  4 14:19:35 Nex-One %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface Ethernet1/11, operational Transmit Flow Control state changed to off
2013 Feb  4 14:19:35 Nex-One %ETHPORT-5-SPEED: Interface port-channel11, operational speed changed to 10 Gbps
2013 Feb  4 14:19:35 Nex-One %ETHPORT-5-IF_DUPLEX: Interface port-channel11, operational duplex mode changed to Full
2013 Feb  4 14:19:35 Nex-One %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface port-channel11, operational Receive Flow Control state changed to off
2013 Feb  4 14:19:35 Nex-One %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface port-channel11, operational Transmit Flow Control state changed to off
2013 Feb  4 14:19:44 Nex-One %ETH_PORT_CHANNEL-4-PORT_INDIVIDUAL: port Ethernet1/12 is operationally individual
2013 Feb  4 14:19:44 Nex-One %ETHPORT-5-IF_UP: Interface Ethernet1/12 is up in mode trunk
2013 Feb  4 14:19:45 Nex-One %ETH_PORT_CHANNEL-4-PORT_INDIVIDUAL: port Ethernet1/11 is operationally individual
2013 Feb  4 14:19:45 Nex-One %ETHPORT-5-IF_UP: Interface Ethernet1/11 is up in mode trunk
2013 Feb  4 14:21:16 Nex-One %ETHPORT-5-IF_DOWN_LINK_FAILURE: Interface Ethernet1/11 is down (Link failure)
2013 Feb  4 14:21:16 Nex-One %ETH_PORT_CHANNEL-5-PORT_INDIVIDUAL_DOWN: individual port Ethernet1/11 is down
2013 Feb  4 14:21:16 Nex-One %ETHPORT-5-IF_DOWN_LINK_FAILURE: Interface Ethernet1/12 is down (Link failure)
2013 Feb  4 14:21:16 Nex-One %ETH_PORT_CHANNEL-5-PORT_INDIVIDUAL_DOWN: individual port Ethernet1/12 is down
2013 Feb  4 14:21:18 Nex-One %ETHPORT-5-SPEED: Interface Ethernet1/11, operational speed changed to 10 Gbps
2013 Feb  4 14:21:18 Nex-One %ETHPORT-5-IF_DUPLEX: Interface Ethernet1/11, operational duplex mode changed to Full
2013 Feb  4 14:21:18 Nex-One %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface Ethernet1/11, operational Receive Flow Control state changed to off
2013 Feb  4 14:21:18 Nex-One %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface Ethernet1/11, operational Transmit Flow Control state changed to off
2013 Feb  4 14:21:18 Nex-One %ETHPORT-5-SPEED: Interface port-channel11, operational speed changed to 10 Gbps
2013 Feb  4 14:21:18 Nex-One %ETHPORT-5-IF_DUPLEX: Interface port-channel11, operational duplex mode changed to Full
2013 Feb  4 14:21:18 Nex-One %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface port-channel11, operational Receive Flow Control state changed to off
2013 Feb  4 14:21:18 Nex-One %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface port-channel11, operational Transmit Flow Control state changed to off
2013 Feb  4 14:21:18 Nex-One %ETHPORT-5-SPEED: Interface Ethernet1/12, operational speed changed to 10 Gbps
2013 Feb  4 14:21:18 Nex-One %ETHPORT-5-IF_DUPLEX: Interface Ethernet1/12, operational duplex mode changed to Full
2013 Feb  4 14:21:18 Nex-One %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface Ethernet1/12, operational Receive Flow Control state changed to off
2013 Feb  4 14:21:18 Nex-One %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface Ethernet1/12, operational Transmit Flow Control state changed to off
2013 Feb  4 14:21:18 Nex-One %ETHPORT-5-SPEED: Interface port-channel12, operational speed changed to 10 Gbps
2013 Feb  4 14:21:18 Nex-One %ETHPORT-5-IF_DUPLEX: Interface port-channel12, operational duplex mode changed to Full
2013 Feb  4 14:21:18 Nex-One %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface port-channel12, operational Receive Flow Control state changed to off
2013 Feb  4 14:21:18 Nex-One %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface port-channel12, operational Transmit Flow Control state changed to off
2013 Feb  4 14:21:28 Nex-One %ETH_PORT_CHANNEL-4-PORT_INDIVIDUAL: port Ethernet1/11 is operationally individual

Ideas?

ESXi config:

vSwitch1

vmnic5:

vmnic4:

teaming:

# ? Feb 4, 2013 15:48

evil_bunnY: Apr 2, 2003

MC Fruit Stripe posted:

I am browsing the hard drive of the lab box, and I swear to everything holy I can not find my god damned datastore. The SAN's vmdk files are a total of like 1gb. What is going on on this lab box? I am in bizarro world.

I don't understand what you're trying to do. If your ISO's are on your SAN, you just connect the ISO from that datastore.

# ? Feb 4, 2013 15:50

Dilbert As FUCK: Sep 8, 2007; by Cowcaster; Pillbug

On your host you have the nic balancing as IP HASH right?

Here's the KB on what you are doing http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004048

MC Fruit Stripe posted:

VMware lab question. I had mine all set up perfectly, but was in a real deadspot at work with nothing to do over the past few weeks so I just deleted the entire thing and started fresh. Fun stuff.

But now, I've got a few ESXi instances running, all connected through vcenter, and with openfiler playing SAN on the backend. Carved out a 200gb VMs LUN and presented it, connected it up, put my ISOs on a separate datastore, created the VM itself, ran it

Aaaannnddd tiiiimmmeee ssslllooowwweeeeeddd ddoooooowwwwnnnnnnnnnn

I've never seen Windows install this slowly, and I've never had this problem before, so I'm a little about it. I'm 20 minutes into a Server 2008 install, and it's at 8% on expanding files (not even installing).

Has anyone seen this before, or have suggestions? It is not a "yo poo poo is slow" problem, it is a "yo poo poo is hosed", but I can't figure out where it's hosed.

----

Problem #2 - I have moved 50gb worth of ISOs and data to a datastore which is hosted on my SAN. The files are there. The files are accessible. But where are the files? Like, I am browsing the hard drive of the lab box, and I swear to everything holy I can not find my god damned datastore. The SAN's vmdk files are a total of like 1gb. What is going on on this lab box? I am in bizarro world.

First off.
Are you using Software ISCSI or hardware?
Version of ESXi?
Is this the only VM on the box or are there multiple VM's
If it is an Openfiler SAN did you install tools and use the proper Storage Adapter? Thick or thin disks?
NFS or ISCSI? If NFS I would suggest you assign it 1-2gb ram and 2vcpu's, Iscsi should be fine 1vcpu 1gb ram. run a conary updateall on the Openfiler box.
(you can also try freenas which has some VAAI/ZFS support)

What I would try
Move the ISO to the local datastore and see if the time required to deploy goes down.
Check to see what the datastore is reporting used, as well as network to the datastore. Check for high latency to each datastore.
What are the specs of which you are installing? Sometimes I'll assign more resources to a VM to build a template them trim them down after install/updates.

Problem #2 I have no idea what you are asking here

Dilbert As FUCK fucked around with this message at 16:41 on Feb 4, 2013

# ? Feb 4, 2013 16:32

evil_bunnY: Apr 2, 2003

Yeah:

# ? Feb 4, 2013 16:36

Dilbert As FUCK: Sep 8, 2007; by Cowcaster; Pillbug

For the hell of it have you tried giving the vSwitch off ALL VLAN and specify 739?

E:
Also is the 5.0 or 5.1?
And they aren't broadcom nics are they?

Dilbert As FUCK fucked around with this message at 17:06 on Feb 4, 2013

# ? Feb 4, 2013 16:42

evil_bunnY: Apr 2, 2003

I have not. It's 5.0 and Intel 10bge nics.

# ? Feb 4, 2013 17:32

Powdered Toast Man: Jan 25, 2005; TOAST-A-RIFIC!!!

Corvettefisher posted:

No, the over-provisioning of vCPU's is very common, it is fine to have more vCPU's than physical/logical cores, but be aware that too much over provisioning can cause performance derogation on the host and in term the virtual machines. You'll need to check your CPU ready and performance of virtual machines in order to find the load that your host can take.

Esxtop on the two hosts I was concerned about showed most of the VMs under 1% RDY, although one is hovering at around 5%, which is supposed to be bad, I guess? It's a very high traffic web server.

# ? Feb 4, 2013 17:51

Dilbert As FUCK: Sep 8, 2007; by Cowcaster; Pillbug

Powdered Toast Man posted:

Esxtop on the two hosts I was concerned about showed most of the VMs under 1% RDY, although one is hovering at around 5%, which is supposed to be bad, I guess? It's a very high traffic web server.

<1% is great, 1-3% is okay, 5+ is where the VM might be experiencing some slowdowns.

How many vCPU's does it have and how utilized are those cores(report from within the web server IE taskmanager/TOP)

# ? Feb 4, 2013 18:12

Powdered Toast Man: Jan 25, 2005; TOAST-A-RIFIC!!!

Here's a current view of esxtop on the host in question:

(whited out areas are VM names)

The server has two X5680 processors and 96GB of memory; it's a Dell R710.

# ? Feb 4, 2013 18:26

Combat Pretzel: Jun 23, 2004; No, seriously... what kurds?!

Here's something I'm really curious about:

If Hyper-V's only intended to run Windows as dom0, and the Windows kernel has awareness of Hyper-V, why the hell isn't the hypervisor integrated into the kernel? You know, similar to KVM, probably able to get some more performance out of it all, because there wouldn't be this hypercall barrier between the hypervisor and the host OS.

# ? Feb 4, 2013 18:33

Dilbert As FUCK: Sep 8, 2007; by Cowcaster; Pillbug

Powdered Toast Man posted:

Here's a current view of esxtop on the host in question:

(whited out areas are VM names)

The server has two X5680 processors and 96GB of memory; it's a Dell R710.

It really comes down to if the CPU RDY is noticeably adversely affecting performance to your end users.

Things I would check;
Reduce the number of vCPU's on the VM. It sounds a bit counter intuitive however reducing the time that the scheduler has to ask "can I process this" and can reduce the CPU ready. Enable Hot Add, which allows non disruptive assignment of vCPU's.
Enable Hyperthreading if not already done so, as well as VT-d.
Move it to a host with a newer/higher performing CPU (assuming you set up EVC should be non disruptive)

# ? Feb 4, 2013 18:44

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

evil_bunnY posted:

Repost from the cisco thread, since half the problem is virtualization-related. Maybe one of you guys has a clue.

I'm once again stumped by my Nexus'es unwillingness to do my bidding.

The problem is that I can't get my port-channels to some new ESXi hosts to get/stay up.

Config: (indentical on both switches)
[code]
Nex-One# sho run

interface port-channel11
description esx01-nfs
switchport mode trunk
switchport trunk allowed vlan 739
spanning-tree port type edge trunk
speed 10000
vpc 11

interface Ethernet1/11
description esx01-nfs
switchport mode trunk
switchport trunk allowed vlan 739
channel-group 11 mode active
...

You're using LACP with a standard vSwitch. This won't work.

Change to 'channel-group X mode on'

Also make sure the switch is configured for src-dst-ip assuming that actually exists on a nexus still. src-dst-ip-port (or whatever it is may also work.)

1000101 fucked around with this message at 18:53 on Feb 4, 2013

# ? Feb 4, 2013 18:46

Adbot: ADBOT LOVES YOU

# ? May 9, 2024 14:32

Viktor: Nov 12, 2005

I'm researching dumping XenServer as were up for renewal and moving over to a VMware vSphere essentials kit. I wanted to handle the whole data recovery with one shot. How is VDP in the essentials plus kit? Can I get away with using it to a backup NAS instead of Veeam?

# ? Feb 4, 2013 19:18

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Virtualization Megathread V2: VMs inside VMs

«‹›312 »