Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Less Fat Luke
May 23, 2003

Exciting Lemon
Thanks. It did sound too good to be true :-)

Adbot
ADBOT LOVES YOU

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.
The lines between type 1 and type 2 hypervisors have been increasingly blurred over the years, to the point where it's almost an irrelevant distinction in modern times (or, at least, not a granular enough one to be useful).

TheFace
Oct 4, 2004

Fuck anyone that doesn't wanna be this beautiful

Alfajor posted:

I've been working on setting up syslogs from a VMware 6.5 environment, and having a bitch of a time with the ESX part (vCenter is going well).
In the ESX hosts, I set the variable "Syslog.global.logHost" to poing to my syslog server, enable the firewall rule on the host, and messages pour out. However, the bulk of the messages are "debug", and I couldn't find a way to get them to step down to "info" or even "warning".
So I gave up, and reset "Syslog.global.logHost" to blank, disabled the firewall rule for syslog on the host, and even did a "esxcli system syslog reload"... and messages are still flowing to my syslog endpoint, even 2 hours later.

Now I'm assuming that I'm missing something obvious, and the same solution to make syslogs stop going out might be the same as changing the effective logging levels.
Any ideas?

I didn't see this answered, but there's a config.xml file for hostd, and vpxd.cfg (also xml file), and you can control the log level there. You're probably set to verbose. For putting them back local, did you change Syslog.global.LogDir at all when you pointed to the syslog server?

BangersInMyKnickers
Nov 3, 2004

I have a thing for courageous dongles

Happiness Commando posted:

I did a firmware update on one of my Dell servers. vMotion to the host now takes about half an hour per VM. Away from it takes the normal one or two minutes :(

VMware support was useless, unsurprisingly. I should have checked with Dell first. I will probably do that Monday...

What NICs are you using? If you say X710 you win the poo poo NIC Lottery.

Moey
Oct 22, 2010

I LIKE TO MOVE IT

BangersInMyKnickers posted:

What NICs are you using? If you say X710 you win the poo poo NIC Lottery.

I have a bunch of QLogic FastLinQ 44164 NICs (4x10gbe) in our newer Dell R640 servers, no issues yet (crosses fingers). Running ESXi 6.7

BangersInMyKnickers
Nov 3, 2004

I have a thing for courageous dongles

I had to dump the X710's for the Qlogics which were much better. Had to max out the ring buffer size (to 4096 or something like that) to get consistent 10gig throughput without drops under vmotion load through, would float around 4-5gig otherwise.

https://communities.vmware.com/thread/393600

Happiness Commando
Feb 1, 2002
$$ joy at gunpoint $$

BangersInMyKnickers posted:

What NICs are you using? If you say X710 you win the poo poo NIC Lottery.

These NICS were Broadcom 5720 and 5719. My 10 GbE are X550.

BangersInMyKnickers
Nov 3, 2004

I have a thing for courageous dongles

Check the VMware compatibility guide and try to figure out what driver/rev is being invoked. I believe there is a command in the command line tool set to parse it.

https://www.vmware.com/resources/co...r&sortOrder=Asc

https://www.vmware.com/resources/co...r&sortOrder=Asc

https://www.vmware.com/resources/co...r&sortOrder=Asc

You want to make sure your firmware rev matches the vib/driver version and if you don't have the correct one to get it pushed through update manager. As a general rule, you should be trying to use the most current native driver available to match your firmware rev. If the native driver isn't functioning correctly, you may need to fall back to the emulated vmklinux driver. Generally the native driver will supersede the vmklinux driver but not always, and in that case you will need to manually disable or uninstall the offending vib that you do not want to use. Be aware that this going sideways may result in your management interface going dark and needing to recover with a local console or your DRAC.

Once you get through all that, try adjusting the ring buffer sizes from the link I posted above. Also I think you already did this but a 9k+ mtu is pretty much a requirement to get 10gig line speed on any card on the market that isn't the x710+ (which can can saturate a 10gig interface with 64byte frames, but is so wildly unstable that you cannot use it in production).

BangersInMyKnickers fucked around with this message at 05:04 on Aug 15, 2018

Alfajor
Jun 10, 2005

The delicious snack cake.

TheFace posted:

I didn't see this answered, but there's a config.xml file for hostd, and vpxd.cfg (also xml file), and you can control the log level there. You're probably set to verbose. For putting them back local, did you change Syslog.global.LogDir at all when you pointed to the syslog server?

I did mess with config.xml and vpxd.cfg, and didn't see the changes of what was pouring into syslog.
The .logDir variable is empty, not populated. I do wonder if that needs to be set so that it can be "shipped" from there... but I don't know if there's really a thing.

At any rate, I ended up disabling everything, and after a few hours, the messages just stopped coming in to GrayLog. :iiam:
Which was a decision, because even with all the debug data (300k a day per host!), I couldn't really find any useful metrics to dashboard. Just grabbing events from vCenter is enough apparently, so wooohooo, I've moved on from this mysterious syslog gray area.

TheFace
Oct 4, 2004

Fuck anyone that doesn't wanna be this beautiful

Alfajor posted:

I did mess with config.xml and vpxd.cfg, and didn't see the changes of what was pouring into syslog.
The .logDir variable is empty, not populated. I do wonder if that needs to be set so that it can be "shipped" from there... but I don't know if there's really a thing.

At any rate, I ended up disabling everything, and after a few hours, the messages just stopped coming in to GrayLog. :iiam:
Which was a decision, because even with all the debug data (300k a day per host!), I couldn't really find any useful metrics to dashboard. Just grabbing events from vCenter is enough apparently, so wooohooo, I've moved on from this mysterious syslog gray area.

You'll probably want to get logs at least dumping locally or to a datastore if you haven't already, It's the first thing VMware asks for in pretty much every support case.

nulldev1ce
Aug 16, 2002
Shiny Globule
Has anyone here experienced the following?

Environment: Dell PowerEdge T630 running ESXi 6.5, which is installed to a dual SD card rig, and the RAID'd disks are used for VM storage only. A few weeks back I bought an Essentials license. Spooled up several VMs, set up vCenter, etc. Everything was normal and sane and happy.

At some point I needed to physically move the server. I cleanly shut down all guests, then VMWare itself. Unplugged, moved, replugged, powered on. Server came up fine, but all but two VMs were missing from the list. I went to the licensing tab, and the Essentials license was gone. Instead it showed a key with all zeroes and a message "evaluation license expired." I scratched my head, reapplied the license key, it recognized it, as it had before. I registered the missing VMs, and all seemed well again.

Today, someone rebooted the host (for a dumb reason, irrelevant) and it happened again -- came up OK, but the real license was lost, replaced by an expired eval license, and all but two VMs deregistered. Also, the two NFS datastores I'd set up yesterday were gone (not inactive/grayed out, but rather, missing altogether.) Once again I reapplied the Essentials license key, which it again recognized, re-registered the missing VMs, and redefined the NFS datastores.

Neither I nor the consultant I'm working with have ever seen this happen before, and I'm not finding anything helpful on Google. It's not like it's losing the whole config on reboot -- I can still log in with the same password, the first two VMs are still there, etc. -- it just seems to lose its memory of having a legit license. (Although, I don't think the NFS datastores should be affected by licensing, so maybe that's unrelated.) Whiskey tango foxtrot?

anthonypants
May 6, 2007

by Nyc_Tattoo
Dinosaur Gum
Does ESXi say anything about having trouble writing to the SD cards, or can you plug a LiveCD/LiveUSB into the server and see if you can write to the SD cards?

evol262
Nov 30, 2010
#!/usr/bin/perl
Your cards are dead. Lots of SD cards (and SSDs) go to read-only once new blocks can't safely be written.

All the information you're talking about is only living in memory. It's probably desperately trying to log this fact and failing

evol262 fucked around with this message at 14:20 on Aug 18, 2018

Thanks Ants
May 21, 2004

#essereFerrari


Does your iDRAC say anything about the SD card health?

Wibla
Feb 16, 2011

evol262 posted:

Your cards are dead. Lots of SD cards (and SSDs) go to read-only once new blocks can't safely be written.

All the information you're talking about is only living in memory. It's probably desperately trying to log this fact and failing

Seems like it wouldn't be very hard to move to M.2 SSDs for this kind of poo poo.

evol262
Nov 30, 2010
#!/usr/bin/perl

Wibla posted:

Seems like it wouldn't be very hard to move to M.2 SSDs for this kind of poo poo.

I mean, those do the same thing, unless the basic nature of writing to flash has changed in the past few years (I haven't kept up on it)

M2 SSDs solve the problem in the same way that SD and internal USB ports used to, except faster and with write leveling (which most SD/USB don't have, since there's no enough IO to worry about it). They'll still eventually go read-only, though I've never seen even a first gen SLC SSD poo poo out this way. One of the earlier MLC vendors known for gaming equipment cheaped out on controllers and had this happen a lot. I don't remember the vendor, but someone here will

Plus this hardware probably doesn't have NVMe.

Basically recommendation for systems on SD/CF/USB/whatever flash without write leveling is and has been to ship logs, run as much as possible on a ramdisk, and generally avoid high volume writes. Not everyone knows or follows that, though.

mewse
May 2, 2006

evol262 posted:

M2 SSDs solve the problem in the same way that SD and internal USB ports used to, except faster and with write leveling (which most SD/USB don't have, since there's no enough IO to worry about it). They'll still eventually go read-only, though I've never seen even a first gen SLC SSD poo poo out this way. One of the earlier MLC vendors known for gaming equipment cheaped out on controllers and had this happen a lot. I don't remember the vendor, but someone here will

Probably OCZ, they had a reputation for being crap

adorai
Nov 2, 2002

10/27/04 Never forget
Grimey Drawer

evol262 posted:

I don't remember the vendor, but someone here will
Pretty sure it was the early sandforce chipset ssds.

nulldev1ce
Aug 16, 2002
Shiny Globule

anthonypants posted:

Does ESXi say anything about having trouble writing to the SD cards, or can you plug a LiveCD/LiveUSB into the server and see if you can write to the SD cards?

evol262 posted:

Your cards are dead. Lots of SD cards (and SSDs) go to read-only once new blocks can't safely be written.

All the information you're talking about is only living in memory. It's probably desperately trying to log this fact and failing

The cards are < 6 months old, but I will check when I'm onsite next (VPN is down, for ISP reasons.) Thanks for the suggestions, will report back what if anything I find.

Pile Of Garbage
May 28, 2007



Pretty sure your hypervisor boot volume shouldn't be seeing significant amounts of write. You probably just got lovely flash.

Edit: I recently saw 1/5 failure rates on cards for Cisco UCS blades the year before last so don't exclude hardware failure.

nulldev1ce
Aug 16, 2002
Shiny Globule
K, the drac shows the following for "Removable flash media":

Internal Dual SD Module:
Redundancy Status = Full

Internal SD Module Status:
vFlash = Absent
IDSDM SD1 = Good
IDSDM SD2 = Good

I can ssh into the host and cd / and touch foo successfully, and I can do "logger HELLO" and it appears at the end of syslog.log. That means I can write to the boot volume, yeah? (Of course, I'm terrified to reboot the host to find out, since I'm in the midst of a P2V migration that took 13 hours to complete.)

I did this:

[root@vh1:/var/log] grep -i error * | grep 2018-08-17 > ~/mylogs.txt
[root@vh1:/var/log] cd ~/
[root@vh1:~] less mylogs.txt
WARNING: terminal is not fully functional
esxcli-software.log:2018-08-17T08:51:13Z esxcli-software: esxupdate: 65827: HostImage: ERROR: No host acceptance level is configured
esxupdate.log:2018-08-17T08:52:44Z esxupdate: HostImage: INFO: Installer <class 'vmware.esximage.Installer.BootBankInstaller.BootBankInstaller'> was
not initiated - reason: altbootbank is invalid: Error in loading boot.cfg from bootbank /bootbank: Error parsing bootbank boot.cfg file /bootbank/b
oot.cfg: [Errno 2] No such file or directory: '/bootbank/boot.cfg'^@
esxupdate.log:2018-08-17T08:52:59Z esxupdate: 67464: HostImage: INFO: Installer <class 'vmware.esximage.Installer.BootBankInstaller.BootBankInstalle
r'> was not initiated - reason: altbootbank is invalid: Error in loading boot.cfg from bootbank /bootbank: Error parsing bootbank boot.cfg file /boo
tbank/boot.cfg: [Errno 2] No such file or directory: '/bootbank/boot.cfg'^@
iofiltervpd.log:2018-08-17T09:53:05Z iofiltervpd[66601]: run:159:SSL Connection error 30 : SSL_ERROR_SSL error:14094416:SSL routines:ssl3_read_bytes
:sslv3 alert certificate unknown
jumpstart-stdout.log:2018-08-17T08:51:16.447Z| Method invocation failed: iodm->start() failed: error while executing the cli
jumpstart-stdout.log:2018-08-17T08:51:40.813Z| execution of 'system coredump partition set --enable=true --smart' failed : Unable to smart activate
a dump partition. Error was: No suitable diagnostic partitions found..
jumpstart-stdout.log:2018-08-17T08:51:40.813Z| Method invocation failed: dump-partition->start() failed: error while executing the cli
jumpstart-stdout.log:2018-08-17T08:52:41.811Z| Method invocation failed: vmci->start() failed: error while executing the cli
jumpstart-stdout.log:2018-08-17T08:52:43.919Z| Method invocation failed: vmswapcleanup->start() failed: error while executing the cli
rabbitmqproxy.log:2018-08-17T08:52:54Z rabbitmqproxy: 2018-08-17T08:52:54.347Z error -[62C51B0] [Originator@6876 sub=Default] [configInfo.cpp 203 rm
p::ConfigInfo::load], Configuration file /etc/vmware/rabbitmqproxy/config.xml contains no broker
rabbitmqproxy.log:2018-08-17T08:52:54Z rabbitmqproxy: 2018-08-17T08:52:54.402Z error -[5F101B0] [Originator@6876 sub=Default] [configInfo.cpp 203 rm
p::ConfigInfo::load], Configuration file /etc/vmware/rabbitmqproxy/config.xml contains no broker
syslog.log:2018-08-17T08:51:16Z jumpstart[65874]: Method invocation failed: iodm->start() failed: error while executing the cli
syslog.log:2018-08-17T08:51:40Z jumpstart[65902]: execution of 'system coredump partition set --enable=true --smart' failed : Unable to smart activa
te a dump partition. Error was: No suitable diagnostic partitions found..
syslog.log:2018-08-17T08:51:40Z jumpstart[65874]: Method invocation failed: dump-partition->start() failed: error while executing the cli
syslog.log:2018-08-17T08:52:41Z jumpstart[65874]: Method invocation failed: vmci->start() failed: error while executing the cli
syslog.log:2018-08-17T08:52:41Z jumpstart[66364]: Error: More than one exception specification for tardisk /tardisks/vsan.v00
syslog.log:2018-08-17T08:52:41Z jumpstart[66364]: Error: Ignoring /etc/vmware/secpolicy/tardisks/vsan
syslog.log:2018-08-17T08:52:42Z jumpstart[66369]: Error: More than one exception specification for tardisk /tardisks/vsan.v00
syslog.log:2018-08-17T08:52:42Z jumpstart[66369]: Error: Ignoring /etc/vmware/secpolicy/tardisks/vsan
syslog.log:2018-08-17T08:52:43Z jumpstart[65874]: Method invocation failed: vmswapcleanup->start() failed: error while executing the cli
syslog.log:2018-08-17T08:52:54Z slpd[67269]: test - LOG_ERROR
syslog.log:2018-08-17T09:13:06Z ImageConfigManager: 2018-08-17 09:13:06,063 [MainProcess INFO 'HostImage' MainThread] Installer <class 'vmware.esxim
age.Installer.BootBankInstaller.BootBankInstaller'> was not initiated - reason: altbootbank is invalid: Error in loading boot.cfg from bootbank /boo
tbank: Error parsing bootbank boot.cfg file /bootbank/boot.cfg: [Errno 2] No such file or directory: '/bootbank/boot.cfg'^@
syslog.log:2018-08-17T09:13:06Z ImageConfigManager: 2018-08-17 09:13:06,223 [MainProcess INFO 'HostImage' MainThread] Installer <class 'vmware.esxim
age.Installer.BootBankInstaller.BootBankInstaller'> was not initiated - reason: altbootbank is invalid: Error in loading boot.cfg from bootbank /boo
tbank: Error parsing bootbank boot.cfg file /bootbank/boot.cfg: [Errno 2] No such file or directory: '/bootbank/boot.cfg'^@
[snipped repeats]
syslog.log:2018-08-17T09:58:54Z ImageConfigManager: 2018-08-17 09:58:54,317 [MainProcess INFO 'HostImage' MainThread] Installer <class 'vmware.esxim
age.Installer.BootBankInstaller.BootBankInstaller'> was not initiated - reason: altbootbank is invalid: Error in loading boot.cfg from bootbank /boo
tbank: Error parsing bootbank boot.cfg file /bootbank/boot.cfg: [Errno 2] No such file or directory: '/bootbank/boot.cfg'^@
usb.log:2018-08-17T08:52:49Z usbarb[66643]: USBArbRuleStore: Error in '/etc/vmware/usbarb.rules' at line 1:0, '[' or '{' expected near end of file.
usb.log:2018-08-17T08:52:49Z usbarb[66643]: SOCKET connect failed, error 2: No such file or directory
vitd.log:2018-08-17T08:47:29Z vitd[155459]: VITD: Thread-0xf79dca0dc0 Calling _exit() due to error encountered above: No such file or directory
vmkdevmgr.log:2018-08-17T08:51:30Z vmkdevmgr[65985]: Failed to set driver for 0x76e14303447bcfe3: Unable to complete Sysinfo operation. Please see
the VMkernel log file for more details.: Sysinfo error: Not foundSee VMkernel log for details.
vmkdevmgr.log:2018-08-17T08:51:30Z vmkdevmgr[65985]: Error binding driver vmkernel for bus=logical addr=pci#s00000008.00#0 id=com.vmware.HBAPort
vmkdevmgr.log:2018-08-17T08:51:30Z vmkdevmgr[65985]: Failed to set driver for 0x67a4303447bd15b: Unable to complete Sysinfo operation. Please see t
he VMkernel log file for more details.: Sysinfo error: Not foundSee VMkernel log for details.
vmkdevmgr.log:2018-08-17T08:51:30Z vmkdevmgr[65985]: Error binding driver vmkernel for bus=logical addr=pci#p0000:00:11.4#0 id=com.vmware.HBAPort
vmkdevmgr.log:2018-08-17T08:51:30Z vmkdevmgr[65985]: Failed to set driver for 0x30134303447bd497: Unable to complete Sysinfo operation. Please see
the VMkernel log file for more details.: Sysinfo error: Not foundSee VMkernel log for details.
[snipped repeats]
vmkdevmgr.log:2018-08-17T08:51:31Z vmkdevmgr[65985]: Error binding driver vmkernel for bus=logical addr=pci#p0000:00:1f.2#0 id=com.vmware.HBAPort
vpxa.log:2018-08-17T08:52:56.462Z error vpxa[97C0480] [Originator@6876 sub=Default] Failed to set the current working directory: /var/log/vmware/vpx
, N7Vmacore15SystemExceptionE(No such file or directory)
vpxa.log:2018-08-17T08:52:56.794Z error vpxa[97C0480] [Originator@6876 sub=Default] [VimXml] Error fetching /sdk/vimServiceVersions.xml: 503 (Servic
e Unavailable)
vpxa.log:2018-08-17T08:52:56.794Z error vpxa[97C0480] [Originator@6876 sub=Default] Remote does not report support for WSDL namespace vim25
vvold.log:2018-08-17T08:52:52.234Z error vvold[5243850] [Originator@6876 sub=Default] VVold SI:main, no VVol config available, exiting

I'm too ignorant to know whether any of these are a smoking gun, or just normal error output. What should I check next?

evol262
Nov 30, 2010
#!/usr/bin/perl
Unknown whether this is installed as embedded or installable. Embedded only tries to flush to disk once per hour (even if you touch a file). Installable is heavy enough on disk writes that it's not recommended for flash.

Either way, you should look at the DRAC to see if there are storage errors. If not, it's hard to say what's happening, but this is so symptomatic of a flash failure that it's the only guess I've got.

Happiness Commando
Feb 1, 2002
$$ joy at gunpoint $$

We had SD card errors even though DRAC reported everything was fine. A couple weeks later the controller on the motherboard blew up and we went from having a degraded but functioning host that couldn't write files to storage to having an ESXi host that wouldn't boot

Happiness Commando fucked around with this message at 21:00 on Aug 20, 2018

BangersInMyKnickers
Nov 3, 2004

I have a thing for courageous dongles

cheese-cube posted:

Pretty sure your hypervisor boot volume shouldn't be seeing significant amounts of write. You probably just got lovely flash.

Edit: I recently saw 1/5 failure rates on cards for Cisco UCS blades the year before last so don't exclude hardware failure.

log writing will go to flash (rather than ram) if its large enough, that can cause write wear pretty quick. Make sure you point your logging at the datastore.

BangersInMyKnickers
Nov 3, 2004

I have a thing for courageous dongles

Keep in mind that the dual SD module thing isn't some raid controller with parity and a read patrol march, just a simple failover scheme in the event that one of them complete craps itself. If one is corruption itself or failing in a way that doesn't render the system unbootable, it may not detect the failure mode.

usb.log:2018-08-17T08:52:49Z usbarb[66643]: USBArbRuleStore: Error in '/etc/vmware/usbarb.rules' at line 1:0, '[' or '{' expected near end of file.
usb.log:2018-08-17T08:52:49Z usbarb[66643]: SOCKET connect failed, error 2: No such file or directory

vpxa.log:2018-08-17T08:52:56.462Z error vpxa[97C0480] [Originator@6876 sub=Default] Failed to set the current working directory: /var/log/vmware/vpx
, N7Vmacore15SystemExceptionE(No such file or directory)

That might be an indication of corruption/write failure. Try swapping the SD cards around so the other one becomes primary and see if it behaves better.

Potato Salad
Oct 23, 2014

nobody cares


I spent a little money and a decent amount of time moving to SD cards only to move back to local sata ssds.

It's when you need to shut down in a hurry that SD booting into ramdisk gets fucky, and that kind of situation has bad overlap with other problems -- often the kind isolating you from your custom scratch / trace dump

Potato Salad fucked around with this message at 23:36 on Aug 20, 2018

Potato Salad
Oct 23, 2014

nobody cares


Writing vsan trace to SD cards at shutdown is rear end. If you have a customer who wants to jump on the bizarrely-popular vsan train, do not loving let them use anything but decent ssds for the boot partition, whether full 2.5" or m.2 boot drives.

H2SO4
Sep 11, 2001

put your money in a log cabin


Buglord
Since we're in the middle of SD card chat, is it typical for ESX boot to be slow as poo poo when using SD cards? I'm also a little worried because I bought Samsung microSD cards that came with the adapters because I couldn't find any actual full size SD cards at Frys. Before I bring this box to my colo lab should I swap them for something else or is there not much of a difference? This is for an R720 with the dual SD module, and I loaded the latest 6.5 Dell image.

Moey
Oct 22, 2010

I LIKE TO MOVE IT
How slow? I can time a R630 or R640 tomorrow, but I would say once ESXi starts booting, it's about a minute?

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.
ESXi doesn't read or write the media once it's running, so if this is for a colo use case instead of something you're turning on and off five times a day, does it actually matter?

H2SO4
Sep 11, 2001

put your money in a log cabin


Buglord
I just timed it and I'm probably just being impatient. ESX took 2:33 to come up, 40 seconds in the initial kernel load and 1:53 on the grey/yellow screen.

k-uno
Jun 20, 2004
I'm completely new to VMs, and I have a question about setting them up with socket affinity. I'm a physics prof and want to set up a two-socket workstation for research. The code I run (mostly Mathematica) is very CPU and memory bandwidth intensive, parallelizes very well up to a point, but has core count limitations (won't scale above 16 threads per instance). With one CPU in the system this is no big deal, but in a two socket machine it runs like poo poo by default since it doesn't know how to intelligently distribute data between the memory banks of the two sockets, and the bandwidth loss and latency increase that results from that kills performance. I want this to be a local system since while I have access to a cluster, I do a lot of restarting and tinkering with the code as I'm running it, and the hassle of submitting things into a queue gets really annoying in that case.

I had posted about this in the Intel thread, and it was recommended that I find ways to manually set socket affinity at the OS level, so specific threads are pinned to specific cores. However, it occurred to me that an even better solution could be to build a two socket machine (with at least 16 cores/socket), and run VMs; I would like to run two copies of Windows 10 simultaneously, and set them up so that each copy only accesses one socket and one memory bank. In principle this could get around core count limitations and NUMA issues, which could be a big performance boost if the overhead for running VMs isn't too bad. And in the relatively rare situations where I need communication between that many processes, or more than one bank's worth of RAM, I could just run a single instance with all the resources. However, I've never set up VMs before so I don't really know what I'm doing. Is what I want to do easy to accomplish? If so, what software should I use? VMware or Hyper-V?

Potato Salad
Oct 23, 2014

nobody cares


Processor affinity is a thing for sure.

You're in a "ooh I can do this? I could try this!" exploratory place right now. I have worked with many professors--this is good and normal :D

However, do make a list of pros/cons between making an esxi 2vm system where access would have to be remote and administration just a little more complicated versus getting two separate win10 machines, keeping things simpler, and maybe even splurge for an i9-7960X Skylake X and overclock the everliving hell out of it.


And out of those two, use VMware

Potato Salad
Oct 23, 2014

nobody cares


Intel's X cpus are basically xeons with no ecc support (no difference), smaller L3 caches (not too consequential if you're constantly loading/unloading huge vectors from main memory anyway), and include overclocking for CPU and RAM. These have worked miracles for *some* chem simulation, but everyone (rightly) reaches for xeons first without testing out the "I need to decode two streams of 8K video pive for my Plex player while playing 80 Skyrim mods" basement dweller black magic option for their specific small sim use case.

BangersInMyKnickers
Nov 3, 2004

I have a thing for courageous dongles

Vulture Culture posted:

ESXi doesn't read or write the media once it's running, so if this is for a colo use case instead of something you're turning on and off five times a day, does it actually matter?

This is not true, it will opportunistically write logs to SD if it is large enough (I think 8gb+ is the delineation, but its been a while).

k-uno posted:

I'm completely new to VMs, and I have a question about setting them up with socket affinity. I'm a physics prof and want to set up a two-socket workstation for research. The code I run (mostly Mathematica) is very CPU and memory bandwidth intensive, parallelizes very well up to a point, but has core count limitations (won't scale above 16 threads per instance). With one CPU in the system this is no big deal, but in a two socket machine it runs like poo poo by default since it doesn't know how to intelligently distribute data between the memory banks of the two sockets, and the bandwidth loss and latency increase that results from that kills performance. I want this to be a local system since while I have access to a cluster, I do a lot of restarting and tinkering with the code as I'm running it, and the hassle of submitting things into a queue gets really annoying in that case.

I had posted about this in the Intel thread, and it was recommended that I find ways to manually set socket affinity at the OS level, so specific threads are pinned to specific cores. However, it occurred to me that an even better solution could be to build a two socket machine (with at least 16 cores/socket), and run VMs; I would like to run two copies of Windows 10 simultaneously, and set them up so that each copy only accesses one socket and one memory bank. In principle this could get around core count limitations and NUMA issues, which could be a big performance boost if the overhead for running VMs isn't too bad. And in the relatively rare situations where I need communication between that many processes, or more than one bank's worth of RAM, I could just run a single instance with all the resources. However, I've never set up VMs before so I don't really know what I'm doing. Is what I want to do easy to accomplish? If so, what software should I use? VMware or Hyper-V?

I would go one of two routes:

1) assuming your software can only saturate a single 16c socket, create two 16 vCPU VMs on the host and let them run normally. The hypervisor resource manager will keep them from cross numa boundaries assuming they are exposed in the host hardware as it knows this incurs a penalty and will try to avoid it.

2) Make a single 32c VM on the host and go in to the advanced processor settings where you can define virtual cores on a socket instead of standard vCPUs. Then the guest OS resource scheduler will keep track of keeping the worker threads for each process assigned to their corresponding virtual socket and then the hypervisor will coordinate that with its numa awareness.

If you plan on saturating all cores and this is the only workload on the box, I would strongly advise running a bare metal install and not bothering with virtualization. You're just adding complexity and overhead with very few of the benefits of virtualization.

Potato Salad posted:

Intel's X cpus are basically xeons with no ecc support (no difference), smaller L3 caches (not too consequential if you're constantly loading/unloading huge vectors from main memory anyway), and include overclocking for CPU and RAM. These have worked miracles for *some* chem simulation, but everyone (rightly) reaches for xeons first without testing out the "I need to decode two streams of 8K video pive for my Plex player while playing 80 Skyrim mods" basement dweller black magic option for their specific small sim use case.

If he's doing research work, ECC would be advisable. Hell, if you're core constrained I'd be looking at Ryzen or thread ripper and getting twice the cores for your buck and spin up more instances of the process.

BangersInMyKnickers fucked around with this message at 14:32 on Aug 21, 2018

Moey
Oct 22, 2010

I LIKE TO MOVE IT

BangersInMyKnickers posted:

This is not true, it will opportunistically write logs to SD if it is large enough (I think 8gb+ is the delineation, but its been a while).

Syslog server is the solution here.

Wicaeed
Feb 8, 2005
Has anyone taken a crack at writing a PowerCLI script to fully Provision a new VM from a template?

I had one at a previous job but I've since lost it, but I remember it wasn't as simple as calling Create-VM and then calling it a day.

Even something as simple as targeting a single datastore for a new VM was complicated because a Datastore & Datastore-Cluster are two distinct objects that need to be evaluated when choosing the datastore.

BallerBallerDillz
Jun 11, 2009

Cock, Rules, Everything, Around, Me
Scratchmo

Moey posted:

Syslog server Elastic stack is the solution here.

TheFace
Oct 4, 2004

Fuck anyone that doesn't wanna be this beautiful

Wicaeed posted:

Has anyone taken a crack at writing a PowerCLI script to fully Provision a new VM from a template?

I had one at a previous job but I've since lost it, but I remember it wasn't as simple as calling Create-VM and then calling it a day.

Even something as simple as targeting a single datastore for a new VM was complicated because a Datastore & Datastore-Cluster are two distinct objects that need to be evaluated when choosing the datastore.

Should just be:

New-VM -Template (template name here) -Name (namehere) -ResourcePool (cluster name here) -Datastore (datastorename here)

Adbot
ADBOT LOVES YOU

adorai
Nov 2, 2002

10/27/04 Never forget
Grimey Drawer
it's ELK stack. Elastic is just one component.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply