Virtualization Megathread V2: VMs inside VMs

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Virtualization Megathread V2: VMs inside VMs

«‹›5 »

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

adorai posted:

I thought TPS was gone now.

Still there, just unlikely to get invoked.

# ¿ Mar 11, 2016 23:36

Adbot: ADBOT LOVES YOU

# ¿ May 14, 2024 15:31

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

Wicaeed posted:

Well I'd love to read the article, but the VMware KB site is (yet again) down.

Maybe that's the problem with the VMware KB, they run on vmxnet3 NICs

It may actually just be that you've gotten the cookie of death from their website. I find that sometimes it'll tell me its down and clearing my cache or a different browser sorts it out.

# ¿ Mar 31, 2016 04:21

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

Wicaeed posted:

School me on the VMware SDN options.

We've got teams (DevOps/QA) that love AWS. I'll admit it's fairly flexible to scale up new resources & then scale then down again when no longer in use, but I have some misgivings about the cost associated with running a large number of resource hungry VMs on AWS. I myself work on the Datacenter Ops side, which includes fairly heavy VMware/UCS administration. I love our platform, but it does lack flexibility right now since our DNS server/DHCP servers are statically configured (no Dynamic DNS, no static dhcp bindings), and our network (VLANs/IP Subnets) is completely statically configured as well.

One of our leadership's goals for my team is to provide faster turnaround for new environments for both or QA & Dev departments. I've done the best I can from my end (templating VM's that hadn't been done before, creating new Customization profiles for departments, etc), however now I'm coming to a point at which I feel like I can't really do much more, without new tools from VMware. At the same time, our Dev team is looking into tools like Packer & Terraform to allow us to rapidly provision (and de-provision) pre-configured environments for both Dev/QA/Stage/etc, and eventually even our Prod environments. This is a fairly massive effort that is still ongoing, as it requires a rewrite of our entire stack, so at least I've got time on my hands.

AWS provides some really flexible options for creating new networks/ip ranges & making sure that they remain secure and segmented between Prod environments and stage environments. I've quickly found out that VMware's vanilla offerings for vCenter don't really allow for this and to get that functionality you need to start looking into vCloud, or even NSX.

I don't even have a VCP yet (still working on it) and haven't done the HOL yet for either vCloud or NSX, but if I wanted to start looking at our options which should I look at first?

Also making it more complicated is our Cisco UCS. I know it's powerful, and I love how easy it is working with the platform, but I doubt we're doing even 15% of what UCS can do. I imagine that since SDN involves creating & deleting VLANs/Subnets from the network, there's some integration that I need between Cisco & VMware too.

NSX can provide you overlays via VXLAN, load balancers, firewalls, and routers through a combination of kernel modules and virtual appliances. It integrates with a couple cloud management platforms out there.

I've got a number of customers that basically just "rubber stamp" out copies of their network for testing, QA and even to eventually roll it to production.

If you don't need to provision a lot of network services then you can probably just look at something like embotics or some other cloud management tool.

# ¿ Jun 6, 2016 05:25

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

I'd ask if they see themselves needing to frequently provision networks or if you can just hand them a pool of networks that they have free reign over to select IPs/provision into. Depending on the applications/what the developers need you probably don't need to provision on-demand networks but if you have a use case that requires it then NSX would be worth looking into.

VMware actually very recently dropped the pricing on the bits of NSX that would provide this capability https://www.vmware.com/products/nsx/compare.html (it's in the lowest tier.) I don't think they've done a good job communicating this though. I just spent the better part of 20 minutes trying to figure out if I could even tell you that. Basically the edition you need is going to be ~2000 per socket list (as opposed to the previous 6000 a socket list.) Adorai, it may be worth revisiting if the lowest tier addresses your problems.

This could allow you the flexibility to provide 'VPC-like' networking, floating IPs, etc. In fact you could potentially couple it with VIO (VMware integrated openstack) and have a lot of the same APIs/interfaces.

If they're asking "why vmware" then the answer is pretty much "my job as Wicaeed is to support the poo poo you build and I need to understand the infrastructure to do it!"

I would basically sit down and go over a few scenarios that will be common for them. Good odds that you can make Terraform work for a lot of it using the default vmware networking even if they're doing all the fancy poo poo hashicorp is selling.

# ¿ Jun 7, 2016 02:58

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

mayodreams posted:

I'll share my NTP story that I use as an example of what not to do for interviews.

We had a number of small banks as clients, and this particular one was a newer addition that we were adding some new servers for as a play to get more business. So I was sent out to the main branch to update the firmware of the ESXi hosts we had put into production like a month earlier. You might be thinking, 'Hey mayodreams, why would you do that if they just went into production?', and you would be right.

I got to the site around noon, and I told the lady overseeing my access to the room it should take about 60-90 minutes for me to update the servers and things should be good to go. There were 3 servers, so I start moving VMs around so I could power one down and start the process. First server goes great, and I'm shuffling VMs to work on the second one. Then something happens and I'm locked out of the admin server I was using. I try the VMware console and it wont authenticate with my AD creds, and since we didn't believe in documentation, I can't find a local admin credential to use.

I am starting to get worried. My boss and his boss (the director) were literally upstairs negotiating a 7 figure contract with this bank, and things are starting to go sideways during the day.

Then I realize the problem: NTP was not configured on the last host I moved the domain controller to, and both DCs are like 5 hours off the time of the other workstations and servers. Panic is setting in. How long before workstations start locking because the time is off? I could have had a major impact at this client in the middle of the day. I am furiously calling and ping/emailing my colleagues to find the local admin creds. I finally get them, but not before the lady comes back and asks what is taking so long. I am trying to keep my cool, which I do, and I tell her that some of the patches are taking a little longer, but I am working on it. I get the creds to both the DCs and login to update the time once I fixed NTP on ESXi and things are good.

My boss comes down after their pitch and tells me not worry and freak out over this. I am astonished.

A couple of days later during the weekly call, I say that as a result of the issue I had, we REALLY need to get checklists for builds so things like this don't get missed an we cause issues for our clients. I then get chewed out for 45 minutes by my boss with the rest of the team on the call.

Two days later I was fired on day 90 of my 90 day probationary period because 'we are professionals, so we don't need checklists' and 'documentation is pointless because it is out of date as soon as it is written'.

And that ended the worst experience of my professional career.

You basically pointed out a major leadership failure so he threw you under the bus. It sounds like its a good thing though because that sounds like a terrible place to work!

# ¿ Aug 18, 2016 05:50

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

Wicaeed posted:

Anyone here actually do Netflow/sFlow logging across an entire VDS or Cisco UCS deployment?

Trying to evaluate what options I have if I want to look into doing something like that, without either:

A) Giving Solarwinds my business
B) Paying an arm and a leg

I have a feeling it's either A or B

You probably want to look at vRealize Network Insight. I want to say its 1500 per socket and it will give you total visibility of whats going on in the VDS and the UCS.

edit: this came from the Arkin acquisition.

# ¿ Sep 20, 2016 18:37

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

BangersInMyKnickers posted:

I've been working with VMware support for a few months now on a really annoying issue with their LACP implementation. The first part is that vdSwitches with LACP default to long timeouts (90s) which is not exactly desirable for failed link detection on the storage fabric. The switches default to short timeouts (3s) and I'm forced to just live with the mismatch. You can manually configure each host to run the vdSwitch with short timeouts over SSH but there doesn't seem to be a way to change the vdSwitch definition so it resets back to long on reboot. This ends up manifesting itself at points in the night when disk traffic is extremely light and the host goes a few seconds without sending disk requests. The only thing moving at that point are the lacp pdus but the host is sending them out so slow that the switch thinks the link dies so it bounce, the host freaks out and re-establishes it, and your logs get flooded with nasty looking stuff.

That issue seems to be exacerbated by problem 2: Their lacp load-balancing implementation doesn't work. All outbound traffic from a single IP homes to a single link on the aggregate despite going to different target IPs and ports which their balancing algorithm should be dividing up. I get balanced traffic back in from the storage appliance IPs, but one of the two outbound links is completely dead, 0 packets/s and no amount of tweaking will get it to pass traffic. It fails over just fine, but the balancing is non-existent.

The CSR is hinting that 6.5 will fix the long timeout issue on the vdSwitches, but no word on the balancing algorithm. How the hell does this poo poo get past QA?

Thats been a big problem as of late.

That said are you using iSCSI? If so stop using LACP and swap to MPIO. Also make sure your switch and hosts agree on your load balancing hash. Lots of physical switches default to src-mac and need to be manually changed.

Also if going to multiple switches make sure they support MLAG or VPC or something equivalent and that it's properly configured. Should be able to adjust the timers on the switch side and link failures should still be detected instantly.

# ¿ Nov 11, 2016 07:29

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

Distributed vSwitch can honor/set 802.1p values. Turn on network IO control I believe and you should see it in the dvSwitch settings somewhere.

What's the Arista sending you the traffic with? How is the machine freaking out?

# ¿ Nov 22, 2016 09:32

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

MC Fruit Stripe posted:

It has been so long since I've done it that I can't even think of the answer clearly - what functions do I lose if I have ESXi installed with access to only local storage? We're giving up what, vMotion, HA, DRS? (e: FT...)

HA and HA related things.

You can do svmotion and regular vmotion without shared storage now as of 5.5 if I recall.

# ¿ Apr 27, 2017 05:17

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

H2SO4 posted:

Keep in mind that (at the last time I researched this) distributed virtual switches do not like HA stuff like VRRP/HSRP. Something to do with the fact that they don't have a CAM table but use VM metadata to decide where traffic for a given MAC address goes to instead. If anyone else knows differently I'd be interested to hear, since the only other sort of workaround for this behavior I'm aware of is putting everything in promiscuous mode.

Enable forged transmits and MAC change and you should be fine. Basically the only checking VMware does is to make sure the MAC the VM is using matches what's in the VMX to prevent a few different types of attacks.

On a distributed vSwitch you can do this on a per-VM basis so you can select each VM in the VRRP group and enable it just for them.

# ¿ Aug 22, 2017 20:03

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

Potato Salad posted:

Lacp exists to cause you more pain and suffering and production losses than the time it takes to set up channels/aggregation up by hand, every single time

It is worth taking a moment to just write some logic to set up static aggregation orchestration with whatever tools you use to manage your switches and your compute

What are people actually getting wrong? There�s not a whole lot to setup on LACP beyond timers (which most platforms only have 1 option) and if the interfaces are going to actively send LACP PDUs or not. I�ve probably seen more people get static link aggregation wrong where maybe one side has the wrong load distribution algorithm set. I think the dvswitch itself supports something like 26 different options of which not all exist on all switching platforms.

That said I almost never bother with link aggregation to hypervisors anymore. 10 gig is cheap and source based load distribution doesn�t require upstream switch configuration.

# ¿ May 8, 2020 17:51

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

SlowBloke posted:

Nsx-V is their out of the box solution, which might be a bit overkill if vcsa was too much overhead for you.

NSX-V is EOL you should be looking at NSX-T now which is a complete rewrite and much much much better.

Given the use case though just install a router VM.

# ¿ May 14, 2021 23:35

Adbot: ADBOT LOVES YOU

# ¿ May 14, 2024 15:31

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

Zorak of Michigan posted:

God save us all but OVM can actually solve licensing problems with Oracle RDBMS. It's one of the few on-prem solutions for which they allow sub-capacity licensing. I've pondered it but concluded that the extra time and energy spent dealing with it would exceed the the cost of just eating full capacity licensing.

just build a separate vsphere/whatever hypervisor cluster off to the side and license those sockets. You don't need a separate vcenter or air gap despite what sales says and if you push back they'll eventually cave.

Also they only license for allocated capacity in cloud providers too so that's an option as well.

# ¿ Jul 19, 2022 21:48

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Virtualization Megathread V2: VMs inside VMs

«‹›5 »