Virtualization Megathread V2: VMs inside VMs

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Virtualization Megathread V2: VMs inside VMs

«‹›312 »

YOLOsubmarine: Oct 19, 2004; When asked which Pokemon he evolved into, Kamara pauses.

"Motherfucking, what's that big dragon shit? That orange motherfucker. Charizard."

Internet Explorer posted:

Wicaeed, check this article to make sure your memory settings are appropriate.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2021302

Also I would do 2 vCPU instead of 3. Although it's definitely better than it used to be I don't see the need for odd vCPU assignments.

Good advice here, the extra CPUs may ultimately cause more harm by driving higher co-stop values.

Additionally you may want to check the statistics collection level and the retention on Tasks and Events and make sure they aren't set too high. Also, since you moved your DB a while back you should also verify that your rollup jobs got recreated.

# ? Jun 27, 2015 04:55

Adbot: ADBOT LOVES YOU

# ? May 8, 2024 12:11

Hadlock: Nov 9, 2004

How is RNG handled by Hyper-V?

Supposedly starting with 8.0 and 2012 guests, they have access to the RNG, but the Microsoft crypto service does not have access to that or the host's network activity. It's all virtual anyways.

Is RNG in VMs just a total poo poo show or what?

# ? Jun 27, 2015 11:30

Malcolm XML: Aug 8, 2009; I always knew it would end like ｔｈｉｓ．

evol262 posted:

I'm gonna disagree with this 100%. Improved collaboration, breaking down the walls between armed camps of internal stakeholders who pass the buck back and forth between systems/network/database/dev whenever anything goes wrong, a shorter development pipeline, continuous integration to feed automated builds/deployments from that pipeline, reproducible deployments, infrastructure as code, and all the rest of the stuff that makes "devops" tick as a movement has nothing to do with product support. Moreover, it's that the best people to support a product are everyone involved, all at once, so the problems can quickly be found, fixed, and a new build pushed out instead of waiting 4 days for ops to escalate that ticket to dev who punts it to network.

Bare-metal backups don't fit infrastructure as code. PXEbooting a foreman discovery image (or whatever) or installing from a Glance image via Ironic or something, yes. Cattle don't need backups.

This has what to do with cattle? And why do you have supervisor processes instead of swarming service discovery, and shooting nodes that fall of out quorum or the swarm?

The nice processes improvements that come with ci and infra as code et al are all there to support the notion that the people who develop the code must have skin in the game of the operations around that code not simply throwing it over a wall

It then turns out that this means that this gets really annoying without all those things so we adopt them as a way to scratch our itch

Devops is an attitude change and the rest is just gravy

Re supervisor processes: what do you think actually boots poo poo out of quorum ? If you have cattle and a cow acts stupid you shoot it and get a new cow

For us it was so fast that it meant we didn't address the underlying cause

# ? Jun 27, 2015 16:42

evol262: Nov 30, 2010; #!/usr/bin/perl

Malcolm XML posted:

The nice processes improvements that come with ci and infra as code et al are all there to support the notion that the people who develop the code must have skin in the game of the operations around that code not simply throwing it over a wall

Still disagreeing with this. It's not about dev having "skin in the game". They've almost always had "skin in the game", in the sense that development was still available for escalations from support at many shops under traditional development models, etc.

The process improvements that come with CI and all the rest are to support knowledge transfer both ways. Dev gets to work with the same configurations ops uses in production instead of hacking together their own environments with basically no support and hoping it matches up. Ops has reasonable certainty (though CI) that builds actually work on those environments.

It's really, really easy from an admin/ops perspective to say "well, the problem was always dev and their BS, blah blah", but the problem was every team involved. It is an attitude change. For the org. Not just some parts of it.

Malcolm XML posted:

Re supervisor processes: what do you think actually boots poo poo out of quorum ? If you have cattle and a cow acts stupid you shoot it and get a new cow

For us it was so fast that it meant we didn't address the underlying cause

This is calling effect cause. In general, I think of supervisor processes as supervisord, monit, systemd, etc. Something had a problem, it falls out of quorum, and the supervisor restarts it. But the supervisor is broadly ignorant of whether it's in quorum or not. It just makes sure the process is running, handles daemonization, capturing stdout/stderr/etc. In that sense, your process crashes, systemd (or supervisord or whatever) restarts it, it re-joins the cluster or that microservice works again or whatever. But if that's what you mean, we're talking about two different things.

Cluster managers can handle killing poo poo that's out of quorum. But the idea is rather that your application is resilient and transparently handles failures. So some nodes fell out of quorum. So what? They're not in quorum (or the load balancer can't reach it, or whatever), so they're not an active part of your system anyway. Shooting the nodes is another level of management on top of all of that, and that's something that'll almost never mask races, because it's probably a homegrown system which checks nodes running from that image which aren't part of the active load balancer or redis sentinel or some other way to monitor, then terminates them with nova-cli or aws/ec2-cli or some other process. But you can just stop that and go look at the misbehaving nodes if you really care. Without something hand-holding the processes on the cattle (like supervisord on a crash-prone app), it's really hard to mask a race.

# ? Jun 27, 2015 17:17

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

evol262 posted:

Cluster managers can handle killing poo poo that's out of quorum.

Unless the cluster manager is in-process like basically every distributed system ever written

Vulture Culture fucked around with this message at 18:03 on Jun 27, 2015

# ? Jun 27, 2015 18:00

UberJumper: May 20, 2007; woop

Is there anything decent alternative out there for vCenter for managing the free ESXi hypervisor? We currently have the free ESXi hypervisor on 4 machines, that are mostly used for things like build agents, and testing.

After talking to a VMWare sales rep, we were told that vCenter cannot manage the free ESXi's, and we were quoted a price of around 5k just to be able to bring everything under vCenter. Then even more per server, 5k is already too much for us since we are a relatively small startup.

I am mostly interested in tools for making and managing VM templates. I Currently i have been exporting machines to OVF, and deploying the OVF templates through ovftool. Unfortunately it seems unbearably slow, since i can't store OVF templates on the actual datastore and i am instead forced to store them on a network share.

I tried to get Veeam's management tools going, unfortunately it does not seem to support the free hypervisor.

Any suggestions or recommendations? We don't care about data redundancy or backup solutions, we mostly just want to be able to deploy VM templates in a timely fashion.

# ? Jun 27, 2015 19:18

evol262: Nov 30, 2010; #!/usr/bin/perl

Vulture Culture posted:

Unless the cluster manager is in-process like basically every distributed system ever written

STONITH is handled from the part of the cluster that's still in quorum through mechanisms which aren't dependent on any part of the out-of-quorum system (ideally -- IPMI, power management, killing it from the hypervisor, etc).

If your whole cluster shits the bed, yes. If it's just a few nodes, it doesn't matter.

# ? Jun 27, 2015 19:31

Malcolm XML: Aug 8, 2009; I always knew it would end like ｔｈｉｓ．

evol262 posted:

Still disagreeing with this. It's not about dev having "skin in the game". They've almost always had "skin in the game", in the sense that development was still available for escalations from support at many shops under traditional development models, etc.

The process improvements that come with CI and all the rest are to support knowledge transfer both ways. Dev gets to work with the same configurations ops uses in production instead of hacking together their own environments with basically no support and hoping it matches up. Ops has reasonable certainty (though CI) that builds actually work on those environments.

It's really, really easy from an admin/ops perspective to say "well, the problem was always dev and their BS, blah blah", but the problem was every team involved. It is an attitude change. For the org. Not just some parts of it.

This is calling effect cause. In general, I think of supervisor processes as supervisord, monit, systemd, etc. Something had a problem, it falls out of quorum, and the supervisor restarts it. But the supervisor is broadly ignorant of whether it's in quorum or not. It just makes sure the process is running, handles daemonization, capturing stdout/stderr/etc. In that sense, your process crashes, systemd (or supervisord or whatever) restarts it, it re-joins the cluster or that microservice works again or whatever. But if that's what you mean, we're talking about two different things.

Cluster managers can handle killing poo poo that's out of quorum. But the idea is rather that your application is resilient and transparently handles failures. So some nodes fell out of quorum. So what? They're not in quorum (or the load balancer can't reach it, or whatever), so they're not an active part of your system anyway. Shooting the nodes is another level of management on top of all of that, and that's something that'll almost never mask races, because it's probably a homegrown system which checks nodes running from that image which aren't part of the active load balancer or redis sentinel or some other way to monitor, then terminates them with nova-cli or aws/ec2-cli or some other process. But you can just stop that and go look at the misbehaving nodes if you really care. Without something hand-holding the processes on the cattle (like supervisord on a crash-prone app), it's really hard to mask a race.

devs who did not have enough skin in the game to care about having a short feedback loop between users* needs and themselves is the core issue here

(users = whoever uses your product; actual customers, other product teams, other devs, etc)

before adopting the attitude that the flaws of a product and its operations have consequences for everyone not just ops theres no stimulus for change to improve ops

# ? Jun 27, 2015 19:46

Malcolm XML: Aug 8, 2009; I always knew it would end like ｔｈｉｓ．

our app was so effective at transparently handling failures it ended up masking and recovering from a transient failure that was caused by a race condition to the point where we didn't bother investigating it and when we got around to it we discovered just how deep of an issue it was

thats the flip side to ultrafast reboot and recovery, if it's cheap enough it allows you to ignore core app issues

(and can only really be done by cattle class servers/vms/nodes)

# ? Jun 27, 2015 19:50

Malcolm XML: Aug 8, 2009; I always knew it would end like ｔｈｉｓ．

and by race i mean a race condition internal to each node that caused them to lock up after a while thus taking them out of quorum long enough that an external process rebooted them

anyway the point is resilience is good and pets make that hard b/c its hard to shoot your pets when they done get rabies

# ? Jun 27, 2015 19:54

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

evol262 posted:

STONITH is handled from the part of the cluster that's still in quorum through mechanisms which aren't dependent on any part of the out-of-quorum system (ideally -- IPMI, power management, killing it from the hypervisor, etc).

If your whole cluster shits the bed, yes. If it's just a few nodes, it doesn't matter.

STONITH tends to be implemented on shared-something architectures that lock a resource -- active/passive clusters sharing out a SAN filesystem, for example. A simple way to detect that a node needs to be killed is to, as an elected master, try to claim a resource and fail to get a lock. In a shared-nothing system using Paxos or Raft or whatever where you're solely reliant on cluster heartbeats to determine state, you can't really tell whether the missing node still thinks it's a master or not. It would be really dumb to send IPMI power controls to your cluster nodes every time you have a process crash that causes a node to stop checking into cluster membership just because you have no idea whether or not it's in a good state.

So, it's a very uncommon pattern in shared-nothing architectures designed for cloud deployment, like most modern distributed systems. You almost never see it with something like Elasticsearch, Cassandra, Riak, Ceph, Gluster, etc. The simplest (and least error-prone, and often safest) kind of fencing is "shut the gently caress down if it looks even a little bit like something is wrong."

Even in the best case, STONITH comes with a lot of external dependencies that aren't validated by the cluster manager, and it's much easier to destabilize a working system with STONITH than it is to properly handle your edge cases from a software perspective. Even VMware vSphere makes the nodes responsible for shooting themselves with Isolation Response because STONITH is such an aggravating, error-prone thing to rely on.

Vulture Culture fucked around with this message at 06:51 on Jun 29, 2015

# ? Jun 29, 2015 06:37

bigmandan: Sep 11, 2001; lol internet; College Slice

Has anyone here experienced an issue with the vSphere web client (5.5) where entering text can sometimes result in non-printable characters being saved in strings and VM file names (specifically the DELETE character)? The issue only seems to happen in Chrome as far as I can tell.

# ? Jun 30, 2015 16:09

evol262: Nov 30, 2010; #!/usr/bin/perl

Vulture Culture posted:

Even in the best case, STONITH comes with a lot of external dependencies that aren't validated by the cluster manager, and it's much easier to destabilize a working system with STONITH than it is to properly handle your edge cases from a software perspective. Even VMware vSphere makes the nodes responsible for shooting themselves with Isolation Response because STONITH is such an aggravating, error-prone thing to rely on.

We've all seen network problems where nodes alternate rejoining the cluster then shooting another node in the head until an admin manually intervenes. But STONITH having a lot of external dependencies isn't even the point, and I'm not advocating STONITH in 2015. That said, I don't agree about Isolation Response being different. It's not shooting the "other node", but STONITH is fencing, and isolation response is self-fencing. I used STONITH as a catchall self-fencing, geo-fencing, external-fencing, poison pills, real STONITH, and every other recovery mechanism.

Other systems have done this for years, as has VMware, but then sitting there and doing nothing (or killing the process) is self fencing. gluster fences. ceph fences with eviction and mds blacklisting. cassandra fences itself and waits for an admin with nodetool. elasticsearch just goes split brain and makes the ops people :suicide:

Vulture Culture posted:

STONITH tends to be implemented on shared-something architectures that lock a resource -- active/passive clusters sharing out a SAN filesystem, for example. A simple way to detect that a node needs to be killed is to, as an elected master, try to claim a resource and fail to get a lock. In a shared-nothing system using Paxos or Raft or whatever where you're solely reliant on cluster heartbeats to determine state, you can't really tell whether the missing node still thinks it's a master or not. It would be really dumb to send IPMI power controls to your cluster nodes every time you have a process crash that causes a node to stop checking into cluster membership just because you have no idea whether or not it's in a good state.

So, it's a very uncommon pattern in shared-nothing architectures designed for cloud deployment, like most modern distributed systems. You almost never see it with something like Elasticsearch, Cassandra, Riak, Ceph, Gluster, etc. The simplest (and least error-prone, and often safest) kind of fencing is "shut the gently caress down if it looks even a little bit like something is wrong."

This was basically my entire point. It would be really dumb to terminate nodes because they stop checking into the cluster. Very few "cloud" applications have any kind of a cluster or process manager that'll automatically shoot anything (and the IaaS poo poo doesn't do it by itself). Nodes fall out of the swarm, and either never reconnect or do reconnect and successfully self-heal (or fail to self-heal). Most of these systems don't start back up after they decide self-healing has failed. They sit and wait for intervention (or a bullet). There aren't a ton of race conditions that are gonna be masked by any kind of supervisory process. It'll fail, and it'll just sit there until someone notices. Where does a process supervisor fit in? Where does it hide races?

# ? Jun 30, 2015 17:45

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

evol262 posted:

I used STONITH as a catchall self-fencing, geo-fencing, external-fencing, poison pills, real STONITH, and every other recovery mechanism.

I see "DevOps" being used to mean "tool engineer," and boy oh boy, everyone's sick of hearing me go down this road again. Fencing is a quadrilateral and STONITH is a rhombus. Fencing is good. Self-fencing is good. STONITH is not a synonym for either of these things.

evol262 posted:

There aren't a ton of race conditions that are gonna be masked by any kind of supervisory process. It'll fail, and it'll just sit there until someone notices. Where does a process supervisor fit in? Where does it hide races?

If you're writing really robust, generalizable code, I agree. But I get the very clear impression that you've never worked on code for a startup. You're not going to spend days or even hours working on robust reconnection logic for a service when you might toss out that message broker, or that database, or that method of handling sharding, or that language, or that entire product a week or a month down the road in response to some unanticipated business or technical need. You throw your hands up, you say "gently caress it," and let the service start over. Same thing if your services start out of order, or whatever the "race" (probably not a real race) situation is. Duct tape is the greatest invention in the history of mankind.

I've done the same thing. Service starts and a dependent upstream service isn't running yet? Exit and try again.

Vulture Culture fucked around with this message at 18:15 on Jun 30, 2015

# ? Jun 30, 2015 18:12

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

Is there a good way to figure out how much overhead is being consumed by the iSCSI or NFS software initiator on a 5.5 host? I'm averaging around 4k iops on NFS with bursts upwards of 6-7k across 3 hosts in the cluster. I will be doing a hardware refresh in the next few months and am wondering if I am getting anywhere near the point where I should be messing around with hardware initiators or iSCSI offload with fancy emulex/qlogic cards instead of the more typical broadcom/intel stuff. I was expecting an NFS and iSCSI object under the System performance counters but nothing there jumps out at me.

# ? Jun 30, 2015 18:23

bull3964: Nov 18, 2000; DO YOU HEAR THAT? THAT'S THE SOUND OF ME PATTING MYSELF ON THE BACK.

I can't answer your question, but I have done 80,000 IOPS from a single host (actually single VM on a host) before with the software iSCSI initiator without issue.

# ? Jun 30, 2015 19:06

KS: Jun 10, 2003; Outrageous Lumpwad

Gave up on some pretty expensive QLE8242 hardware initiators because they would not work reliably. Most notably they could not survive a SAN controller failure.

Just use the software initiator. The community is so much bigger, bugs are fixed much faster, and it doesn't limit performance at all.

# ? Jun 30, 2015 19:14

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Some hardware initiators (e.g. Broadcom) also don't support jumbo frames, which is pretty hilarious when you get up to that 80,000 IOPS mark.

# ? Jun 30, 2015 19:15

mayodreams: Jul 4, 2003; Hello darkness,
my old friend

bigmandan posted:

Has anyone here experienced an issue with the vSphere web client (5.5) where entering text can sometimes result in non-printable characters being saved in strings and VM file names (specifically the DELETE character)? The issue only seems to happen in Chrome as far as I can tell.

I've found Chrome (my primary browser) to be very lovely for the vCenter web client in 5.x. I've being using Firefox with good results. vCenter 6.0 web client does work better with Chrome though.

# ? Jun 30, 2015 19:49

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

Vulture Culture posted:

Some hardware initiators (e.g. Broadcom) also don't support jumbo frames, which is pretty hilarious when you get up to that 80,000 IOPS mark.

Yeah, I'm definitely paying the extra $50 for the x710 interfaces over the Broadcom-Whatevers.

# ? Jun 30, 2015 20:01

Wicaeed: Feb 8, 2005

NippleFloss posted:

Good advice here, the extra CPUs may ultimately cause more harm by driving higher co-stop values.

Additionally you may want to check the statistics collection level and the retention on Tasks and Events and make sure they aren't set too high. Also, since you moved your DB a while back you should also verify that your rollup jobs got recreated.

Turned out that one of the JVM heap sizes was set to about 1/4 of what is recommended was for a deployment of our size.

I wouldn't call it "snappy" but it's now about twice as fast as before

# ? Jun 30, 2015 20:37

FISHMANPET: Mar 3, 2007; Sweet 'N Sour
Can't
Melt
Steel Beams

So silly DevOps question: does the methodology have any value outside of delivering an in house software product to "customers" (be those internal, external, whatever). I think there's some value in automating the creation of new infrastructure and having more easily replaced cattle and fewer hand curated pets, but at the end of the day I don't interface with any devs whatsoever, I'm only deploying black boxes that I can't really look inside.

# ? Jun 30, 2015 20:44

evol262: Nov 30, 2010; #!/usr/bin/perl

Vulture Culture posted:

If you're writing really robust, generalizable code, I agree. But I get the very clear impression that you've never worked on code for a startup. You're not going to spend days or even hours working on robust reconnection logic for a service when you might toss out that message broker, or that database, or that method of handling sharding, or that language, or that entire product a week or a month down the road in response to some unanticipated business or technical need. You throw your hands up, you say "gently caress it," and let the service start over. Same thing if your services start out of order, or whatever the "race" (probably not a real race) situation is. Duct tape is the greatest invention in the history of mankind.

There's enough duct tape all over the place in non-startup orgs, and I've encountered at least 4 races in shipping code in RHEL between udev, multipath, and various services. It's not just a startup problem.

But I'm arguing for the opposite, really. Service starts up and can't connect to the metadata server or redis or whatever? Loop a couple of times until it can, restart it, whatever. Great.

I wouldn't spend days or hours on robust reconnection logic, either. And I probably wouldn't spend any time on self-healing. So why am I reconnecting? Once it's up (the first time), register to consul. Writing a health check for your app is trivial, and is already part of the deployment. If it dies, do I trust that a restart from systemd or supervisord or whatever won't gently caress things up? Probably not. It's :smithcloud:

. Build applications that can take failures. So it dies. The health check fails. Consul doesn't direct anything there anymore. Or whatever your preferred service discovery method is, and those are stupid easy to switch out. You're probably going to be registering to some kind of load balancer or dynamic cluster or swarm anyway, so you can replace consul with mesos or kubernetes or your poison.

From there, you can spin up new nodes automatically (if there aren't enough for whatever your service definition is), and figure out what's dead, so you can either let some reaper come through and clean them up after they've been in a failed state for 24 hours (or 3 hours, or whatever your budget can handle for wasting CPU time on a busted node) or send dev/ops to go poke at them to see if there's a specific cause.

Dev/ops time is expensive. CPU time is cheap. What are your odds of a service that failed (segfault, dump core, stack trace, whatever) catastrophically enough that it needs to be restarted after running properly for 20 minutes or 20 hours coming back up properly on a dirty machine versus just starting from a blank slate? What are the chances it could harm data you actually care about (in your db or kv store or whatever) versus a clean machine? It's a bad tradeoff.

Where's the case for restarting services that die after they've already been running for a while instead of just automatically provisioning a new node that you know isn't busted in any way to join the swarm?

# ? Jun 30, 2015 21:08

stubblyhead: Sep 13, 2007; That is treason, Johnny!; Fun Shoe

NippleFloss posted:

Missing performance data can be a consequence of the SQL performance rollup jobs being missing. It's a common problem when the SQL DB has been moved. Might want to check that.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004382

Are all hosts/guests missing data at the same times?

This appears to have resolved itself, I can't find any VMs that are doing this anymore. Before though it was some VMs but not all. I think there were at least a few on all our hosts, but we only have three in the cluster so that could be just due to small sample size. We're using the vcenter appliance with an embedded database, so I don't think the rollup jobs thing was a factor.

BangersInMyKnickers posted:

Is there a good way to figure out how much overhead is being consumed by the iSCSI or NFS software initiator on a 5.5 host?

We don't have these in our lab so I can't confirm, but esxtop might provide you with that info.

# ? Jun 30, 2015 22:54

Cidrick: Jun 10, 2001; Praise the siamese

stubblyhead posted:

We don't have these in our lab so I can't confirm, but esxtop might provide you with that info.

Off the top of my head I dont know about iscsi, but with FC, esxtop (in the storage device tab) will display a KAVG value that will show you how much time disk operations are spending in the hypervisor kernel layer.

# ? Jun 30, 2015 23:38

minato: Jun 7, 2004; cutty cain't hang, say 7-up.; Taco Defender

FISHMANPET posted:

So silly DevOps question: does the methodology have any value outside of delivering an in house software product to "customers" (be those internal, external, whatever). I think there's some value in automating the creation of new infrastructure and having more easily replaced cattle and fewer hand curated pets, but at the end of the day I don't interface with any devs whatsoever, I'm only deploying black boxes that I can't really look inside.

Infrastructure-as-code has value beyond DevOps, just don't call it DevOps. It's going to give the Ops team tremendous agility for when Dev surprises them by throwing stuff over the wall.

But long term, it's better not to have the wall in the first place. Then Devs can benefit from Ops' infrastructure-as-code by using it to spin up their own environments, and everyone will benefit from there being few (if any) environmental differences between Development / Staging / Production.

# ? Jul 1, 2015 02:32

adorai: Nov 2, 2002; 10/27/04 Never forget; Grimey Drawer

minato posted:

Infrastructure-as-code has value beyond DevOps, just don't call it DevOps. It's going to give the Ops team tremendous agility for when Dev surprises them by throwing stuff over the wall.

But long term, it's better not to have the wall in the first place. Then Devs can benefit from Ops' infrastructure-as-code by using it to spin up their own environments, and everyone will benefit from there being few (if any) environmental differences between Development / Staging / Production.

His question is likely for shops that don't have internal development. My company for instance, doesn't really develop anything in house, we just use off the shelf products. It's the kind of place where IT is a "cost center"

# ? Jul 1, 2015 04:35

goobernoodles: May 28, 2011; Wayne Leonard Kirby.

Orioles Magician.

BangersInMyKnickers posted:

Is there a good way to figure out how much overhead is being consumed by the iSCSI or NFS software initiator on a 5.5 host? I'm averaging around 4k iops on NFS with bursts upwards of 6-7k across 3 hosts in the cluster. I will be doing a hardware refresh in the next few months and am wondering if I am getting anywhere near the point where I should be messing around with hardware initiators or iSCSI offload with fancy emulex/qlogic cards instead of the more typical broadcom/intel stuff. I was expecting an NFS and iSCSI object under the System performance counters but nothing there jumps out at me.

stubblyhead posted:

We don't have these in our lab so I can't confirm, but esxtop might provide you with that info.

http://www.yellow-bricks.com/esxtop/

# ? Jul 1, 2015 06:34

FISHMANPET: Mar 3, 2007; Sweet 'N Sour
Can't
Melt
Steel Beams

adorai posted:

His question is likely for shops that don't have internal development. My company for instance, doesn't really develop anything in house, we just use off the shelf products. It's the kind of place where IT is a "cost center"

Exactly. My team manages Active Directory and Microsoft System Center and random requests for Windows servers but there's no developers handing us anything. Are we the knuckle draggers of IT, destined to be replace by some fancy cloud service?

# ? Jul 1, 2015 16:43

minato: Jun 7, 2004; cutty cain't hang, say 7-up.; Taco Defender

The value of all that DevOps stuff is to make changes easier. If you're not making changes that often because it's just off the shelf software wrapped up with some config, then I don't see much benefit beyond (say) IaaS portability or disaster recovery scenarios.

# ? Jul 1, 2015 16:52

Docjowles: Apr 9, 2009

Long spergpost ahoy :spergin:

DevOps has come to be associated with fancy cloud apps and infrastructure-as-code, but it didn't necessarily start out that way. Early on, it was characterized by the terms "culture, automation, measurement, and sharing" (CAMS). From that angle, there are still ideas and practices associated with DevOps that pretty much any IT shop can benefit from. The way you deploy them will be very different, but the principle still applies.

Speed and agility through testing and automation. If you have a test environment where you can make a change, and run some sanity checks that it did what you want / didn't break poo poo, you'll have higher confidence that what you're doing is correct. Maybe that allows you to make your change control process less onerous because "email proposed change to Joe. Wait for weekly change control meeting. Get yes/no from Joe" becomes "run your tests. Do they pass? Auto-approved."

Self-service. Find ways to allow users to do safe, routine things themselves. A classic example is tying HR's systems into AD so they can provision new hires with accounts automatically without having to open a ticket. Let users reset their own passwords. Allow people to create and delete their own VM's, if that applies in your environment. Get the dumb requests off your plate so you can focus on the harder problems.

Collaboration, trust, and breaking down silos (ties into the above as well). Spend time getting to know your coworkers in other departments. Make sure your team has a seat at the table when the business is planning a new project. That way you know what's coming, can start prepping for it, and raise concerns if you see a red flag that you can't handle without more resources or something. Likewise invite other department heads to your own planning meeting so they aren't shocked when you upgrade to 2012R2 and all their poo poo breaks, which they could have told you if you'd bothered to ask.

Work in a data-driven way. Have great monitoring and graphing tools in place, and use them to make decisions. Make it easy for other departments to monitor and measure things they care about, and display all of that publicly.

Maybe a lot of that just sounds like being good at your job and not being a dick. One of my favorite definitions of DevOps is "giving a poo poo about your job". So yes, that is a big part of it

Those are just some random ideas, maybe they still don't apply to you all that much. But hopefully that's helpful in some small way?

# ? Jul 1, 2015 17:38

Inspector_666: Oct 7, 2003; _{benny with the good hair}

What does "infrastructure-as-code" mean? I find a lot of X-as-X monikers make almost no sense to me.

# ? Jul 1, 2015 17:39

Docjowles: Apr 9, 2009

Inspector_666 posted:

What does "infrastructure-as-code" mean? I find a lot of X-as-X monikers make almost no sense to me.

More or less what it sounds like. Building and thinking about your infrastructure the way a developer does code. You want it to be testable, repeatable, version controlled. Maybe you define all of the hosts that you need as an Amazon CloudFormation template, which is just a fancy text file. You can spin this up in a test account to verify that it does what you want. You can check it into git so you know exactly when XYZ changed, and who did it. The configs of the individual hosts are managed by something like Puppet, and those manifests are also versioned. When you want to make a change, a CI tool like Jenkins pushes it to your test environment and lets you know if it worked before you touch production. Everything is orchestrated by scripts and API's rather than someone logging into a console to type commands/click buttons.

Stuff like that. It's a continuum and very few companies fully operate that way. But it can have huge benefits in terms of speed and reliability of your operations as you move along it.

# ? Jul 1, 2015 17:50

Gyshall: Feb 24, 2009; Had a couple of drinks.
Saw a couple of things.

Inspector_666 posted:

What does "infrastructure-as-code" mean? I find a lot of X-as-X monikers make almost no sense to me.

Basically, being able to run a script or program and have it deploy a virtual instance of whatever according to specifications, etc.

# ? Jul 1, 2015 18:29

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Docjowles posted:

Speed and agility through testing and automation. If you have a test environment where you can make a change, and run some sanity checks that it did what you want / didn't break poo poo, you'll have higher confidence that what you're doing is correct. Maybe that allows you to make your change control process less onerous because "email proposed change to Joe. Wait for weekly change control meeting. Get yes/no from Joe" becomes "run your tests. Do they pass? Auto-approved."

Another benefit that ties in here is that if you have repeatable, versioned environments, it becomes very easy to roll back even very complicated configurations. Change control becomes much less important when the impact of a critical change is that something might be down for 15 minutes at midnight on a weekend. Granted, certain kinds of software play much nicer with this than others.

# ? Jul 1, 2015 18:44

stubblyhead: Sep 13, 2007; That is treason, Johnny!; Fun Shoe

I'm trying to automate some AWS administration with Powershell and I've hit some roadblocks. I've been able start remote sessions on instances by connecting directly to them, but I need to be able to do it through a proxy. Powershell won't let you do this unless you use HTTPS, and starting a HTTPS listener requires a certificate. Before I go any further down this rabbithole, is there any other option available for connecting to AWS instances with Powershell? I saw that there's a plugin available, but the documentation was kind of overwhelming so I'm not sure if it will let me do this or not. Any suggestions?

# ? Jul 1, 2015 23:08

in a well actually: Jan 26, 2011; dude, you gotta end it on the rhyme

BangersInMyKnickers posted:

Is there a good way to figure out how much overhead is being consumed by the iSCSI or NFS software initiator on a 5.5 host? I'm averaging around 4k iops on NFS with bursts upwards of 6-7k across 3 hosts in the cluster. I will be doing a hardware refresh in the next few months and am wondering if I am getting anywhere near the point where I should be messing around with hardware initiators or iSCSI offload with fancy emulex/qlogic cards instead of the more typical broadcom/intel stuff. I was expecting an NFS and iSCSI object under the System performance counters but nothing there jumps out at me.

In my experience, Emulex/QLogic Ethernet 'storage adapter / offload' cards are hilariously poo poo. If your 24-core 2.5 GHz server can't spare the miniscule CPU overhead involved, those cards are going to fall over at that rate anyway.

# ? Jul 3, 2015 03:10

Ahdinko: Oct 27, 2007; WHAT A LOVELY DAY

Yes thank you VMware, I was curious about the throughput and performance of NSX, it is good to know that Bridge gets 1 more receive than VXLAN, this chart has been very useful for me.
(I didn't crop any units/axis labels off, this is as it came from them)

# ? Jul 3, 2015 15:19

chutwig: May 28, 2001; BURLAP SATCHEL OF CRACKERJACKS

PCjr sidecar posted:

In my experience, Emulex/QLogic Ethernet 'storage adapter / offload' cards are hilariously poo poo. If your 24-core 2.5 GHz server can't spare the miniscule CPU overhead involved, those cards are going to fall over at that rate anyway.

I got stuck with trying to sort out the FCoE plans of somebody who had left my last company. He had armed all the servers with Intel X520-DA2s. VMware and bare-metal Linux both used software FCoE initiators with these cards and were completely unable to keep up with even modest traffic, so VMs would constantly kernel panic when FCoE on the hypervisor poo poo the bed and all their backing datastores dropped off. I ordered a couple of Emulex OneConnect cards that obviated the need for the software FCoE crap and they worked really well, right up until one of the cards crashed due to faulty firmware. At that point I'd had enough of trying to salvage the FCoE poo poo, sent everything back and bought a pair of MDSes and some Qlogic FC HBAs, and never thought about my SAN crashing again.

Moral of the story: it did make a difference for me, but don't do FCoE unless you have a Cisco engineer on-site and a company-funded expense account with the nearest boozeria.

# ? Jul 3, 2015 16:09

Adbot: ADBOT LOVES YOU

# ? May 8, 2024 12:11

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

Software FCoE I wouldn't bother with but hardware FCoE works just fine. Bad firmware can bite you with any technology.

# ? Jul 3, 2015 19:59

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Virtualization Megathread V2: VMs inside VMs

«‹›312 »