Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
GrandMaster
Aug 15, 2004
laidback

CommieGIR posted:

Except it's a new issue. Esxi has run fine from both devices for years, something they did changed it.

Exactly, they shouldn't need high io. The tools mount/scratch run from ramdisk and logging goes to san datastore. What else does the card need to be doing?

Also if the card is actually corrupted, it won't come back after a reboot, i feel bad for the hardware vendors that sent out probably thousands of replacements for cards that weren't even busted.

Anyways it was fine in 7.0U1, it was U2 that broke it

Adbot
ADBOT LOVES YOU

CommieGIR
Aug 22, 2006

The blue glow is a feature, not a bug


Pillbug

GrandMaster posted:

Exactly, they shouldn't need high io. The tools mount/scratch run from ramdisk and logging goes to san datastore. What else does the card need to be doing?

Also if the card is actually corrupted, it won't come back after a reboot, i feel bad for the hardware vendors that sent out probably thousands of replacements for cards that weren't even busted.

Anyways it was fine in 7.0U1, it was U2 that broke it

There was always the possibility you corrupt your SD card, in fact a lot of Hyperconverged stuff either uses Mirrored SD cards or SATADOM modules to address this.

Thanks Ants
May 21, 2004

#essereFerrari


SSDs are so cheap now that I'd probably just stick a couple of low end 512GB disks in RAID1 and boot vSphere off that. Used to mess around with the mirrored SD cards in Dell boxes but the write speeds got painful sometimes.

Number19
May 14, 2003

HOCKEY OWNS
FUCK YEAH


I’m speccing my new VMware servers with boot SSDs now too. It gives the option to host a few test/junk VMs on local data stores and avoids the SD/USB boot issues that have been popping up. It’s also more in line with VMware’s recommendations anyways and to be honest, the cost is pretty marginal given how much the SD cards marked up anyways

CommieGIR
Aug 22, 2006

The blue glow is a feature, not a bug


Pillbug
I use two of these in a mirror

https://www.ebay.com/itm/164875884280?hash=item26635e56f8:g:V2kAAOSwDFRgppEV

Pile Of Garbage
May 28, 2007



I recall maybe a decade ago now commissioning some new IBM HS22 blades running whatever was the latest ESXi at the time and they had an issue where their DAS would just offline rendering the host unmanageable (VMs would keep running but you couldn't manage or migrate them). Back then the fix we settled on was to just order some of the special IBM SD cards to use as the boot device on the blades instead of the DAS.

Weird that it sounds like the reverse is now the issue lol.

Pikehead
Dec 3, 2006

Looking for WMDs, PM if you have A+ grade stuff
Fun Shoe

GrandMaster posted:

Exactly, they shouldn't need high io. The tools mount/scratch run from ramdisk and logging goes to san datastore. What else does the card need to be doing?

Also if the card is actually corrupted, it won't come back after a reboot, i feel bad for the hardware vendors that sent out probably thousands of replacements for cards that weren't even busted.

Anyways it was fine in 7.0U1, it was U2 that broke it

Beats me what's doing it, but there have been issues going all the way back to 6.0 where VMware has blamed IO on crappy SD cards:

https://kb.vmware.com/s/article/2149257

My organisation's solution to this and the upcoming depreciation in 7.x has been to start moving to boot-from-san, which also means we can finally start automating building esxi hosts.

Zorak of Michigan
Jun 10, 2006

Pikehead posted:

Beats me what's doing it, but there have been issues going all the way back to 6.0 where VMware has blamed IO on crappy SD cards:

https://kb.vmware.com/s/article/2149257

My organisation's solution to this and the upcoming depreciation in 7.x has been to start moving to boot-from-san, which also means we can finally start automating building esxi hosts.

I've given thought to that but our main data stores are NFS, so I'd be adding a whole new dependency to our environment. How's the transition worked for you?

Pikehead
Dec 3, 2006

Looking for WMDs, PM if you have A+ grade stuff
Fun Shoe

Zorak of Michigan posted:

I've given thought to that but our main data stores are NFS, so I'd be adding a whole new dependency to our environment. How's the transition worked for you?

Yeah, if you're NFS then that's a big issue - all our mainline storage is fibre channel, so boot-from-san involves some fiddling up front but then a fairly smooth experience.

Like everything else it's ground down into "whenever someone has spare time", i.e. very slowly.

It's a hard requirement for us though when we finally make it to 7.x due to all the problems inherent with SD cards (the various corruption issues that keep occuring, no persistent log location, the absolute crap that is the Cisco UCS SD controller/cards). It'll also be nice for faster booting, patching and installing - 99% of the time it doesn't matter, but when I've got to patch an entire site's worth of hosts I hate having to watch that progress bar go left to right slowly.

It'll all be even better when (fingers-crossed) we can automate this stuff away using Autodeploy.

diremonk
Jun 17, 2008

Hoping someone has an idea on an issue I'm having with one of my VMs. We took a power hit yesterday that took down my rack after the UPS ran down. I'm able to get two of my VMs up in ESXi but the third is giving an error of:

code:
Failed - An error occurred while creating temporary file for /vmfs/volumes/5bc4f406-17fa0226-2f08-c81f66ce42df/Zabbix_5_4_VM/Zabbix_5_4_VM.vmx: The file already exists

Errors
An error occurred while creating temporary file for /vmfs/volumes/5bc4f406-17fa0226-2f08-c81f66ce42df/Zabbix_5_4_VM/Zabbix_5_4_VM.vmx: The file already exists
Cannot open the configuration file /vmfs/volumes/5bc4f406-17fa0226-2f08-c81f66ce42df/Zabbix_5_4_VM/Zabbix_5_4_VM.vmx.
Failed to start the virtual machine (error -18).
I googled the error and looked at a couple of the suggestions. I've downloaded the vmx file and it looks fine, no corruption in it. I've tried un-registering the VM and re-registering it, no luck. I tried editing the

Any ideas? I'd rather not nuke this vm since I have my monitoring system on it for my broadcast systems.

CommieGIR
Aug 22, 2006

The blue glow is a feature, not a bug


Pillbug

diremonk posted:

Hoping someone has an idea on an issue I'm having with one of my VMs. We took a power hit yesterday that took down my rack after the UPS ran down. I'm able to get two of my VMs up in ESXi but the third is giving an error of:

code:
Failed - An error occurred while creating temporary file for /vmfs/volumes/5bc4f406-17fa0226-2f08-c81f66ce42df/Zabbix_5_4_VM/Zabbix_5_4_VM.vmx: The file already exists

Errors
An error occurred while creating temporary file for /vmfs/volumes/5bc4f406-17fa0226-2f08-c81f66ce42df/Zabbix_5_4_VM/Zabbix_5_4_VM.vmx: The file already exists
Cannot open the configuration file /vmfs/volumes/5bc4f406-17fa0226-2f08-c81f66ce42df/Zabbix_5_4_VM/Zabbix_5_4_VM.vmx.
Failed to start the virtual machine (error -18).
I googled the error and looked at a couple of the suggestions. I've downloaded the vmx file and it looks fine, no corruption in it. I've tried un-registering the VM and re-registering it, no luck. I tried editing the

Any ideas? I'd rather not nuke this vm since I have my monitoring system on it for my broadcast systems.

Vdiskmanager might be the key to this:

https://stackoverflow.com/questions/46193933/how-to-repair-vmx-file-corrupted-at-vmware

Pikehead
Dec 3, 2006

Looking for WMDs, PM if you have A+ grade stuff
Fun Shoe

diremonk posted:

Hoping someone has an idea on an issue I'm having with one of my VMs. We took a power hit yesterday that took down my rack after the UPS ran down. I'm able to get two of my VMs up in ESXi but the third is giving an error of:

code:
Failed - An error occurred while creating temporary file for /vmfs/volumes/5bc4f406-17fa0226-2f08-c81f66ce42df/Zabbix_5_4_VM/Zabbix_5_4_VM.vmx: The file already exists

Errors
An error occurred while creating temporary file for /vmfs/volumes/5bc4f406-17fa0226-2f08-c81f66ce42df/Zabbix_5_4_VM/Zabbix_5_4_VM.vmx: The file already exists
Cannot open the configuration file /vmfs/volumes/5bc4f406-17fa0226-2f08-c81f66ce42df/Zabbix_5_4_VM/Zabbix_5_4_VM.vmx.
Failed to start the virtual machine (error -18).
I googled the error and looked at a couple of the suggestions. I've downloaded the vmx file and it looks fine, no corruption in it. I've tried un-registering the VM and re-registering it, no luck. I tried editing the

Any ideas? I'd rather not nuke this vm since I have my monitoring system on it for my broadcast systems.

Hmm, are you able to post the directory listing for the virtual machine?

There's possibly a lock file there or one of the ~ files that is still around (but shouldn't be).
Also potentially vmx corruption.

diremonk
Jun 17, 2008

Here's the results of a ls -a on the host. Strange that it is saying that there is no .vmx~ file. Sorry it is kind of small, remoted into my work desktop to grab the screenshot..

Pile Of Garbage
May 28, 2007



diremonk posted:

Here's the results of a ls -a on the host. Strange that it is saying that there is no .vmx~ file. Sorry it is kind of small, remoted into my work desktop to grab the screenshot..



Maybe the issue is with the permissions on the file. Check with ls -alF, not sure what the correct permissions are meant to be though.

Arishtat
Jan 2, 2011

Pile Of Garbage posted:

Maybe the issue is with the permissions on the file. Check with ls -alF, not sure what the correct permissions are meant to be though.

Worst case scenario if the VMX is hosed is you can create a new shell VM and attach the existing VMDK to it. For safety sake I’d actually copy the VMDK to the new VM folder so you have the untouched original as a backup until you get everything working again.

Pikehead
Dec 3, 2006

Looking for WMDs, PM if you have A+ grade stuff
Fun Shoe

diremonk posted:

Here's the results of a ls -a on the host. Strange that it is saying that there is no .vmx~ file. Sorry it is kind of small, remoted into my work desktop to grab the screenshot..



What's in the /var/log/vmkernel.log when you try and power it on?

Saukkis
May 16, 2003

Unless I'm on the inside curve pointing straight at oncoming traffic the high beams stay on and I laugh at your puny protest flashes.
I am Most Important Man. Most Important Man in the World.

diremonk posted:

Here's the results of a ls -a on the host. Strange that it is saying that there is no .vmx~ file. Sorry it is kind of small, remoted into my work desktop to grab the screenshot..



I might try copying or moving all the files to another directory, sounds like there is something screwy with the directory.

diremonk
Jun 17, 2008

OK, moved it to another directory and it starts booting. But then starts having issues with the file system.

Maybe I should just nuke it and start over, then do a proper snapshot and backup once I have it all working.

Nitrousoxide
May 30, 2011

do not buy a oneplus phone



diremonk posted:

OK, moved it to another directory and it starts booting. But then starts having issues with the file system.

Maybe I should just nuke it and start over, then do a proper snapshot and backup once I have it all working.

It sounds like it would probably take less time to do the latter at this point.

Pile Of Garbage
May 28, 2007



diremonk posted:

OK, moved it to another directory and it starts booting. But then starts having issues with the file system.

Maybe I should just nuke it and start over, then do a proper snapshot and backup once I have it all working.

This sounds like multiple different issues possibly related to the storage. If you're also having issues with other VMs on the same storage then you may have a problem.

Otherwise I'd recommend what Arishtat suggested:

Arishtat posted:

Worst case scenario if the VMX is hosed is you can create a new shell VM and attach the existing VMDK to it. For safety sake I’d actually copy the VMDK to the new VM folder so you have the untouched original as a backup until you get everything working again.

Hughmoris
Apr 21, 2007
Let's go to the abyss!
*Disregard. This was more specific to cloud tech than virtualization. Asked in COC.

Hughmoris fucked around with this message at 13:11 on Sep 13, 2021

Thanks Ants
May 21, 2004

#essereFerrari


VMware are discontinuing support for running the hypervisor from only SD cards and USB sticks. You'll need good local disk as well, at which point the SD card adds no value.

https://blogs.vmware.com/vsphere/2021/09/esxi-7-boot-media-consideration-vmware-technical-guidance.html

Thanks Ants fucked around with this message at 12:39 on Oct 4, 2021

Pile Of Garbage
May 28, 2007



Thanks Ants posted:

VMware are discontinuing support for running the hypervisor from only SD cards and USB sticks. You'll need good local disk as well, at which point the SD card adds no value.

https://blogs.vmware.com/vsphere/2021/09/esxi-7-boot-media-consideration-vmware-technical-guidance.html

I assume most servers these days come with M.2 NVMe SSD sockets so should be simple enough to use those instead whilst still avoiding the need for a SATA/SAS drive.

SlowBloke
Aug 14, 2017

Thanks Ants posted:

VMware are discontinuing support for running the hypervisor from only SD cards and USB sticks. You'll need good local disk as well, at which point the SD card adds no value.

https://blogs.vmware.com/vsphere/2021/09/esxi-7-boot-media-consideration-vmware-technical-guidance.html

Well the writing was on the wall, hopefully they will not start requiring stupidly big local disks to run.

Bob Morales
Aug 18, 2006


Just wear the fucking mask, Bob

I don't care how many people I probably infected with COVID-19 while refusing to wear a mask, my comfort is far more important than the health and safety of everyone around me!

Gotta use that 500GB SATA drive Dell won't let you remove from the config somehow!

BlankSystemDaemon
Mar 13, 2009



Pile Of Garbage posted:

I assume most servers these days come with M.2 NVMe SSD sockets so should be simple enough to use those instead whilst still avoiding the need for a SATA/SAS drive.
You'll be hard pressed to find server boards without any SATADOM and 2 of them is by far the norm, for the exact reason that it's the easiest way to add two SSDs in a mirror for ESXi or another type1 hypervisor to boot from.
ESXi is designed to load its OS into memory and run it entirely from there, only writing to the disks when configuration is changed and saved, so there's hardly any point in using NVMe.
The marginal boot time gains aren't going to mean anything, because any company who's serious about nines will be doing live migration.

SlowBloke posted:

Well the writing was on the wall, hopefully they will not start requiring stupidly big local disks to run.
You can't live migrate VMs that's using local storage, which is one of the big selling points of virtualization - so I don't think there's any need to worry about that.

Bob Morales
Aug 18, 2006


Just wear the fucking mask, Bob

I don't care how many people I probably infected with COVID-19 while refusing to wear a mask, my comfort is far more important than the health and safety of everyone around me!

BlankSystemDaemon posted:

You'll be hard pressed to find server boards without any SATADOM and 2 of them is by far the norm, for the exact reason that it's the easiest way to add two SSDs in a mirror for ESXi or another type1 hypervisor to boot from.
ESXi is designed to load its OS into memory and run it entirely from there, only writing to the disks when configuration is changed and saved, so there's hardly any point in using NVMe.
The marginal boot time gains aren't going to mean anything, because any company who's serious about nines will be doing live migration.


See the Dell BOSS card

Pile Of Garbage
May 28, 2007



BlankSystemDaemon posted:

You'll be hard pressed to find server boards without any SATADOM and 2 of them is by far the norm, for the exact reason that it's the easiest way to add two SSDs in a mirror for ESXi or another type1 hypervisor to boot from.
ESXi is designed to load its OS into memory and run it entirely from there, only writing to the disks when configuration is changed and saved, so there's hardly any point in using NVMe.
The marginal boot time gains aren't going to mean anything, because any company who's serious about nines will be doing live migration.

Not sure if it's an issue any more but I remember back on 5.5 and maybe 6.0 if the ESXi root file system broke then it also breaks poo poo like vMotion.

SlowBloke
Aug 14, 2017

BlankSystemDaemon posted:

You can't live migrate VMs that's using local storage, which is one of the big selling points of virtualization - so I don't think there's any need to worry about that.

That's false since 5.1 tho, it just takes an assload of time if you don't have quick local disks and 10g links on the management interfaces.

gallop w/a boner
Aug 16, 2002

Hell Gem
Can anyone help me understand something around CPU Ready and NUMA home node percentage?

We have Cisco B200 blades running ESX 6.7. These have 2 x Xeon E5-2660v3 processors and 384GB. So vSphere sees each host as having 40 logical CPUs.
Each blade hosts 6 x Citrix/RDS VMs. Each Citrix VM has 6 x vCPUs and 40GB memory. Each blade also has a 1 x SIEM monitoring VM that has 2 x VCPUs.
Therefore the VMs on each ESX host have a total of 38 vCPUs assigned. The Citrix VMs have 1 virtual socket with 6 cores, as per the NUMA recommendations.
We have Opvizor monitoring software. During busy periods we see VMs with high CPU Ready and also low NUMA home node percentages. See image below.

We don't understand 1) why we see the high CPU ready even though we have not oversubscribed vCPUs to pCPUs and 2) why we see the poor NUMA performance even though we have followed the NUMA guidelines

Only registered members can see post attachments!

Internet Explorer
Jun 1, 2005





I feel like the more I learn about vNUMA the less I understand it. That being said, I generally don't factor in logical cores when doing NUMA planning. Going that route, you'd have 2x 10 physical cores to work with. I think given that constraint, you might find things running better with 4 vCPU. Now, that just may not be enough for the users on those servers. In that case, I would say it's time to move off hosts with CPUs from 2014.

Definitely appreciate all the information you provided, though. I've done a lot of Citrix performance tuning over my life and I feel like I am still constantly trying to get engineers to appreciate vNUMA in their designs.

It's very clear that for your end users though, they are having a rough time.

Cyks
Mar 17, 2008

The trenches of IT can scar a muppet for life
I'm excited for a follow up as I've just started learning about both those concepts last week trying to get a better grasp of metrics. Which means my input isn't very valuable.

With that said, the CPU Ready Average seems fine from what I've read (aim for less than 5% per vCPU). the NUMA metrics are obviously problematic; are you positive that memory is spread out equally between both processors? SIEM server hogging too much resources?

Internet Explorer
Jun 1, 2005





I don't think 40 GB of memory assigned will cause a problem with vNUMA for those hosts. The rule of thumb that I use for myself is total memory / total processors before I start getting worried about memory.

I think CPU Ready Average % already accounts for each vCPU and while I agree VMware recommends under 5%, to me I would feel much better if I was not close to 5%.

For me, one of the biggest metrics I take into account when I am looking at things is Co-Stop. It can sometimes be hard to pin down the cause, but to me Co-Stop should be as close to 0 as it can be. This can be vCPU overcomit, it can be storage issues, it can be snapshot issues.

I'm curious, how many users do you have on each of those Citrix hosts? Are you doing any in-guest / user session monitoring?

gallop w/a boner
Aug 16, 2002

Hell Gem

Internet Explorer posted:

I'm curious, how many users do you have on each of those Citrix hosts? Are you doing any in-guest / user session monitoring?

We have an average of 6 users per Citrix VM. It's a professional services org and users will typically run several fairly intensive apps at once (e.g. Office, Teams, Acrobat, CRM, Billing).

We have some in-house UX monitoring stuff that launches Outlook within the Citrix desktop and starts a timer, that seems to support the theory that the NUMA problems are having a real-world impact.

George RR Fartin
Apr 16, 2003




I'm fairly new to this, but the last couple of weeks futzing around with Docker and Vagrant on my Debian box have been really fun and enlightening. I figured I'd poke in here and see if I'm wildly off base with my approach and to get a bit of perspective. My experience so far is a raindrop in a lake, so I figured I'd bounce this off people who were working with full buckets of water.

In short, I figured I'd try to write something to deploy and provision a media server. The majority of initial setup was done from this guide:

https://www.linuxserver.io/blog/2017-06-24-the-perfect-media-server-2017

...and I figured that if I wanted to futz around further, the best thing to do would be to use a Vagrant instance. My biggest hangup has been that the host machine is configured to use mergerfs to pool drives and snapraid to handle the parity drive - and I can't for the life of me get the vagrant instance to consistently find and handle the drives. My solution so far is this:

code:
  if not File.exists?(storage)
  vb.customize ['createhd', '--filename', storage, '--variant', 'Fixed', '--size', 5* 1024]
 vb.customize ['storageattach', :id, '--storagectl', 'SATA Controller', '--port', 2, '--device', 0, '--type', 'hdd', '--medium', parity1]
  vb.customize ['storageattach', :id, '--storagectl', 'SATA Controller', '--port', 3, '--device', 0, '--type', 'hdd', '--medium', storage]
end
...and it works if I do it from scratch without an existing instance, but any subsequent attempt to bring vagrant up causes a variety of errors related to the drives having already been created, and not being mountable at that point.

The whole reason for this rigamarole is to give me a test bed for something I want to deploy on the bare host system, so the workaround for mounting drives is exclusively a Vagrant thing. It's taking up a bunch of time, and ultimately I'm not even going to use it since the goal is to just get this working outside a virtual machine.

Is there a better way to approach this? I tend to get tunnel-vision in terms of solutions, so I don't want to be working towards a goal that can, at best, be "serviceable."

TheFace
Oct 4, 2004

Fuck anyone that doesn't wanna be this beautiful

gallop w/a boner posted:

Can anyone help me understand something around CPU Ready and NUMA home node percentage?

We have Cisco B200 blades running ESX 6.7. These have 2 x Xeon E5-2660v3 processors and 384GB. So vSphere sees each host as having 40 logical CPUs.
Each blade hosts 6 x Citrix/RDS VMs. Each Citrix VM has 6 x vCPUs and 40GB memory. Each blade also has a 1 x SIEM monitoring VM that has 2 x VCPUs.
Therefore the VMs on each ESX host have a total of 38 vCPUs assigned. The Citrix VMs have 1 virtual socket with 6 cores, as per the NUMA recommendations.
We have Opvizor monitoring software. During busy periods we see VMs with high CPU Ready and also low NUMA home node percentages. See image below.

We don't understand 1) why we see the high CPU ready even though we have not oversubscribed vCPUs to pCPUs and 2) why we see the poor NUMA performance even though we have followed the NUMA guidelines


You are over subscribed (the 2660 is 10 cores per, never count hyperthreading when considering vCPU to pCPU ratios, or NUMA), just not a lot (1.8:1). I don't know if your monitoring software is accounting for it, but CPU ready as shown in esxtop is a total across the vCPU, so a 6% CPU ready for a VM would be 1% each. That said the tolerance to CPU ready and impact on performance can vary depending on the application, and in this case user experience is the guage, so if you're experiencing performance issues during these times of "high" CPU ready then it doesn't matter what the measurement is, it's too high for your environment.

Having been admin for many Citrix on top of vSphere environments I tend to have better end user experience scaling out instead of up. As someone else suggested you might want to consider adding more VMs but scaling down the vCPU each to 4 vCPU. You'll get fewer users per VM, but you'll have more VMs to distribute that load, and the ESXi CPU scheduler will likely have an easier time not only keeping up, but also keeping the VMs in their NUMA bounds.

battlepigeon
Aug 3, 2008

So the latest vCenter 7.0 update 3 patch broke the possibility of using smb backups of the vCenter config and database.

How do you even gently caress that up, VMware?

Pikehead
Dec 3, 2006

Looking for WMDs, PM if you have A+ grade stuff
Fun Shoe

battlepigeon posted:

So the latest vCenter 7.0 update 3 patch broke the possibility of using smb backups of the vCenter config and database.

How do you even gently caress that up, VMware?

Latest esxi patch causes purple screens when powering on thin provisioned virtual machines, so I'd say that it's just VMware being VMware - https://kb.vmware.com/s/article/86100

Wibla
Feb 16, 2011

I'll just stick to proxmox for now :v:

Adbot
ADBOT LOVES YOU

Number19
May 14, 2003

HOCKEY OWNS
FUCK YEAH


Pikehead posted:

Latest esxi patch causes purple screens when powering on thin provisioned virtual machines, so I'd say that it's just VMware being VMware - https://kb.vmware.com/s/article/86100

Lol how does something like that make it to production?

I’m glad Nimble advises you make thick eager zero VMs so I won’t have to deal with this

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply