Enterprise Storage Megathread: Why is my NAS a SAN?

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Enterprise Storage Megathread: Why is my NAS a SAN?

«‹›207 »

Rhymenoserous: May 23, 2008

optikalus posted:

In EMC's defense, it was only the initial install guys that were garbage. We had some great techs come after the install to assist with new shelf installations, flare upgrades, etc.

However, they should be sending top guys to make sure the system gets up and running and is installed properly and professionally. First impressions are the most important. Wasting 60+ hours of the client's time for a simple installation should never happen.

The initial install team always belongs to the sales guy who sold you the crap and are rarely qualified CE's. I've found when EMC sends a team out for big projects, they generally do it in fours. One guy to be the salesman. One guy to do the work. Two guys to discuss where they are going for lunch.

# ? Feb 16, 2012 17:46

Adbot: ADBOT LOVES YOU

# ? May 10, 2024 06:30

fiddledeedum: Jul 28, 2009

qutius posted:

I wanna work there.

It's a good gig.

We stood it all up in the space of 2-3 months because we followed the KISS principle. Do be warned that there is a well known issue with the 62xx series that causes panics. Thankfully, we've had no service impacting outages due to these panics but it was a bit disconcerting to spend high 7 figures and have filers panic. That said, NetApp support has been fantastic. I realize the $$$ probably influenced the level of support we have but I whole heartedly recommend them

# ? Feb 16, 2012 18:57

YOLOsubmarine: Oct 19, 2004; When asked which Pokemon he evolved into, Kamara pauses.

"Motherfucking, what's that big dragon shit? That orange motherfucker. Charizard."

madsushi posted:

More importantly, are you hiring??? I would love to just hold my body against that rack.

The Army Corps is a big NetApp shop and is currently hiring enterprise SAN admins. Gotta be willing to relocate to Portland, OR or Vicksburg, MS though.

# ? Feb 16, 2012 19:46

Nomex: Jul 17, 2002; Flame retarded.

fiddledeedum posted:

It's a good gig.

We stood it all up in the space of 2-3 months because we followed the KISS principle. Do be warned that there is a well known issue with the 62xx series that causes panics. Thankfully, we've had no service impacting outages due to these panics but it was a bit disconcerting to spend high 7 figures and have filers panic. That said, NetApp support has been fantastic. I realize the $$$ probably influenced the level of support we have but I whole heartedly recommend them

A well known issue you say? Please tell me. I just finished setting up 6 new 62xx HA pairs at work, so I'll want to avoid that bug.

# ? Feb 17, 2012 04:19

The_Groove: Mar 15, 2003; Supersonic compressible convection in the sun

It really makes you appreciate storage hardware when 120 drives fail due to a bad SAS card/cable and you don't see a performance hit while rebuilding 32 at once. That and RAID6 totally saving your rear end from losing data. Close call this week...

# ? Feb 17, 2012 20:28

Nomex: Jul 17, 2002; Flame retarded.

If you lost 120 drives because of a single bad card or cable, you're doing it wrong.

# ? Feb 17, 2012 21:09

evil_bunnY: Apr 2, 2003

Nomex posted:

If you lost 120 drives because of a single bad card or cable, you're doing it wrong.

How were there not 2 paths to that enclosure?

# ? Feb 17, 2012 21:40

The_Groove: Mar 15, 2003; Supersonic compressible convection in the sun

Not really. I guess I forgot to say that they were all manually failed in order to power down a pair of enclosures to swap I/O modules and cables to try and find out what actually went bad. But yeah, if one path from one controller going down causes drive failures, there are bigger problems.

# ? Feb 17, 2012 21:45

bort: Mar 13, 2003

Don't sweat it, it was a RAID125.

# ? Feb 18, 2012 05:30

spoon daddy: Aug 11, 2004; Who's your daddy?; College Slice

Nomex posted:

A well known issue you say? Please tell me. I just finished setting up 6 new 62xx HA pairs at work, so I'll want to avoid that bug.

Ask them about "PCI NMI" issues.

# ? Feb 18, 2012 23:10

Nomex: Jul 17, 2002; Flame retarded.

Why not just tell me what the issue is?

The_Groove posted:

Not really. I guess I forgot to say that they were all manually failed in order to power down a pair of enclosures to swap I/O modules and cables to try and find out what actually went bad. But yeah, if one path from one controller going down causes drive failures, there are bigger problems.

Still, failing 120 drives for any reason shouldn't happen. If Netapp came back to me and told me I had to offline a bunch of drives to find a problem I would laugh in their faces. There should be no scenario where you should need to lose all connection to the disks at any time. That's why everything is redundant.

Nomex fucked around with this message at 16:33 on Feb 20, 2012

# ? Feb 19, 2012 19:43

Megiddo: Apr 27, 2004; Unicorns bite, but their bites feel GOOD.

Are there any good 4-post 42u or 44u open frame racks that do not need to be bolted to a floor?

I figured that this or the virtualization thread would be the best place to ask.

Megiddo fucked around with this message at 20:52 on Feb 20, 2012

# ? Feb 20, 2012 20:48

some kinda jackal: Feb 25, 2003; �
�

I'm a little surprised there's no Enterprise Specific Hardware misc megathread. Talking about enterprise class hardware in the general HW thread seems weird.

# ? Feb 21, 2012 01:14

Muslim Wookie: Jul 6, 2005

Nomex posted:

Why not just tell me what the issue is?

Because a bug that serious is covered by NDA, distributed only to partners and not for customer consumption.

Megiddo posted:

Are there any good 4-post 42u or 44u open frame racks that do not need to be bolted to a floor?

APC racks are pretty decent and don't need bolting.

# ? Feb 21, 2012 03:10

adorai: Nov 2, 2002; 10/27/04 Never forget; Grimey Drawer

Nomex posted:

Why not just tell me what the issue is?

Still, failing 120 drives for any reason shouldn't happen. If Netapp came back to me and told me I had to offline a bunch of drives to find a problem I would laugh in their faces. There should be no scenario where you should need to lose all connection to the disks at any time. That's why everything is redundant.

I have a shelf that netapp tells me I must power cycle in order for it to regain its shelf id. gently caress that, it can be shelf P forever.

# ? Feb 21, 2012 03:18

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Megiddo posted:

Are there any good 4-post 42u or 44u open frame racks that do not need to be bolted to a floor?

I figured that this or the virtualization thread would be the best place to ask.

Most 4-post racks come with stabilization frames that you bolt to the front of the rack, stick out about 8 inches, and sit flush against the floor, but actually bolting a 4-post rack to the floor is kind of... 20th century.

# ? Feb 21, 2012 05:37

Megiddo: Apr 27, 2004; Unicorns bite, but their bites feel GOOD.

I should have mentioned I'm looking for a rack with a pretty shallow depth - 29"-32" or so, which would definitely be more of a concern stability-wise than a 48" deep rack.

I did find a Tripp Lite 4-post open frame rack that comes with stabilization plates, so maybe I'll have to go with that.
http://www.tripplite.com/shared/other_downloads/submittal/SR4POST.pdf

I also found some racks that have caster wheel options, but I'm not sure if that would be stable on a rack with a shallow depth, especially with equipment on sliding rack rails.

Megiddo fucked around with this message at 07:09 on Feb 21, 2012

# ? Feb 21, 2012 06:59

evil_bunnY: Apr 2, 2003

Megiddo posted:

I also found some racks that have caster wheel options, but I'm not sure if that would be stable on a rack with a shallow depth, especially with equipment on sliding rack rails.

That sounds like a painful way to kill yourself, TBH.

# ? Feb 21, 2012 10:38

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

It's perfectly fine as long as you're loading the rack with the heaviest stuff on the bottom. Unless you're stuffing towers onto a shelf, sliding rails are the only reason to even have four posts.

# ? Feb 21, 2012 17:39

Nomex: Jul 17, 2002; Flame retarded.

marketingman posted:

Because a bug that serious is covered by NDA, distributed only to partners and not for customer consumption.

Fair enough. Thanks for the tip though.

# ? Feb 22, 2012 06:14

Rhymenoserous: May 23, 2008

Misogynist posted:

It's perfectly fine as long as you're loading the rack with the heaviest stuff on the bottom. Unless you're stuffing towers onto a shelf, sliding rails are the only reason to even have four posts.

[ASK] Me about the guy who came before me putting UPS + Batteries in the middle of the rack (And stacked on top of the rack... and on the floor....).

I keep expecting to push up a ceiling tile to find a UPS hanging from a wire harness up there.

# ? Feb 22, 2012 19:30

Hyrax: Jul 23, 2004; I'm the Goon in the OP. Dispatch your messenger forthwith.

Disclaimer: I'm not a storage guy (I do VMware stuff mostly), but I do have to run a couple of EqualLogic setups.

We added a new member to a four member pool and after it had initialized its RAID setup we started seeing tons of closed sessions from our iSCSI initiators on all of our ESX hosts. The message ends with "Volume location or membership changed". Within three seconds the initiator picks the connection back up and is fine for anywhere between fifteen minutes and an hour, then it'd do it again. Those few seconds are enough time to piss off the application that these servers are running and essentially make them unusable and that goes against our SLAs on that app. EqualLogic support suggested pulling that new member from the group which didn't fix it immediately; it took another six or so hours after that member had evacuated for the group to stabilize and stop throwing errors.

EqualLogic wanted to blame our NIC config on some of our ESX hosts (a pair of teamed 10G nics with all traffic VLAN'd out), but we'd been running fine with that config for months without any problems. Also, there are a couple of ESX hosts that have separate 1gig NICs for iSCSI traffic. Also, we have a lone Windows host that runs Veeam that also had connection issues, so I highly doubt it's a network config that's the issue.

So, does anyone have an idea what to make of the "Volume location or membership changed" message? That reads to me like the group was moving data around when and pissed off the initiators, but I'm just pulling that interpretation out of my rear end. Any ideas or things that I should check on that new member before I try to put him back in? I need the capacity sooner rather than later, but I don't need another day that blows up our SLAs for the month.

# ? Feb 23, 2012 21:47

three: Aug 9, 2007; i fantasize about ndamukong suh licking my doodoo hole

Hyrax posted:

Disclaimer: I'm not a storage guy (I do VMware stuff mostly), but I do have to run a couple of EqualLogic setups.

We added a new member to a four member pool and after it had initialized its RAID setup we started seeing tons of closed sessions from our iSCSI initiators on all of our ESX hosts. The message ends with "Volume location or membership changed". Within three seconds the initiator picks the connection back up and is fine for anywhere between fifteen minutes and an hour, then it'd do it again. Those few seconds are enough time to piss off the application that these servers are running and essentially make them unusable and that goes against our SLAs on that app. EqualLogic support suggested pulling that new member from the group which didn't fix it immediately; it took another six or so hours after that member had evacuated for the group to stabilize and stop throwing errors.

EqualLogic wanted to blame our NIC config on some of our ESX hosts (a pair of teamed 10G nics with all traffic VLAN'd out), but we'd been running fine with that config for months without any problems. Also, there are a couple of ESX hosts that have separate 1gig NICs for iSCSI traffic. Also, we have a lone Windows host that runs Veeam that also had connection issues, so I highly doubt it's a network config that's the issue.

So, does anyone have an idea what to make of the "Volume location or membership changed" message? That reads to me like the group was moving data around when and pissed off the initiators, but I'm just pulling that interpretation out of my rear end. Any ideas or things that I should check on that new member before I try to put him back in? I need the capacity sooner rather than later, but I don't need another day that blows up our SLAs for the month.

What firmware are you running?

# ? Feb 23, 2012 22:04

Internet Explorer: Jun 1, 2005

Well outside of my experience realm, but how are your NICs teamed? Can you test it on an ESX host without teamed NICs?

# ? Feb 23, 2012 22:05

Hyrax: Jul 23, 2004; I'm the Goon in the OP. Dispatch your messenger forthwith.

three posted:

What firmware are you running?

All the controllers are on 5.1.1-H2.

Internet Explorer posted:

Well outside of my experience realm, but how are your NICs teamed? Can you test it on an ESX host without teamed NICs?

They're teamed active/active, load balanced on based on originating virtual port ID, and failover on link status only. Switch notification and failback are both on.

Our Veeam box talks to this same storage pool and doesn't have teamed NICs for the iSCSI link and it had problems as well, so it doesn't look like the teaming was the problem.

Also, there's a second storage pool that runs our test servers and backup stuff. It's a member off by itself in its own pool and didn't have any problems at all even though it runs on the same iSCSI network. The problems were isolated to the volumes that were being moved on to the new member.

# ? Feb 23, 2012 22:25

EoRaptor: Sep 13, 2003; by Fluffdaddy

Hyrax posted:

Disclaimer: I'm not a storage guy (I do VMware stuff mostly), but I do have to run a couple of EqualLogic setups.

We added a new member to a four member pool...

What EQL devices are members of this group? Certain devices have limits on their group size.

# ? Feb 23, 2012 22:35

Internet Explorer: Jun 1, 2005

I haven't played with their new loadbalancing. Is it possible to disable that to see if the problem persists? Other than that, I would say just updating to the newest firmware. I know the load balancing is fairly new and that firmware is already 6 months old.

# ? Feb 23, 2012 22:40

Hyrax: Jul 23, 2004; I'm the Goon in the OP. Dispatch your messenger forthwith.

Currently has two 5000XVs, a 6000XV, a 6010XV and a 6510E. I was adding a 6510X.

It didn't seem to have a problem adding, only with the initiators flipping their poo poo after that member was added.

# ? Feb 23, 2012 22:40

Hyrax: Jul 23, 2004; I'm the Goon in the OP. Dispatch your messenger forthwith.

Internet Explorer posted:

I haven't played with their new loadbalancing. Is it possible to disable that to see if the problem persists? Other than that, I would say just updating to the newest firmware. I know the load balancing is fairly new and that firmware is already 6 months old.

The load balancing messages manifest with a different error from what I've seen. They'll typically have "Load balancing request was received on the array" in the log. I'll see about upgrading the firmwares and playing with disabling pool load balancing in our next outage window.

# ? Feb 23, 2012 22:49

Internet Explorer: Jun 1, 2005

Ok, like I said it's outside my experience. You have more members then I ever had and we stopped using them before the load balancing between members was added. I do remember a message similar to that with our XenServer hosts and it was related to the NIC teaming between XenServer, our switches, and our Equallogic SANs.

# ? Feb 23, 2012 23:02

EoRaptor: Sep 13, 2003; by Fluffdaddy

Hyrax posted:

Currently has two 5000XVs, a 6000XV, a 6010XV and a 6510E. I was adding a 6510X.

It didn't seem to have a problem adding, only with the initiators flipping their poo poo after that member was added.

Okay, all these devices should be okay for that number of members in a group (only the older PS400, PS4000, and newer PS4100 arrays are impacted by the limit anyway)

I don't know if it's teaming, as you say other non-teamed machines are affected, but I do know EQL recommends multi-path round robin (or their own module if you have vmware enterprise). It could also be the switches being unable to keep up with the amount of redirection from the group ip going on.

# ? Feb 23, 2012 23:13

three: Aug 9, 2007; i fantasize about ndamukong suh licking my doodoo hole

EQL also has a limit on number of connections in a group and storage pool. I don't recall what they are, but I'd look into that. I'm assuming adding a member equals adding connections.

# ? Feb 23, 2012 23:15

BnT: Mar 10, 2006

Hyrax posted:

So, does anyone have an idea what to make of the "Volume location or membership changed" message? That reads to me like the group was moving data around when and pissed off the initiators, but I'm just pulling that interpretation out of my rear end. Any ideas or things that I should check on that new member before I try to put him back in? I need the capacity sooner rather than later, but I don't need another day that blows up our SLAs for the month.

Assuming you're using "Load Balancing" on your Equallogic group, these sound like asynchronous logouts coming from the array. You should be seeing these when adding or removing members from the group, but it shouldn't be disruptive. The array spreads the data across the members and when some data migrates to a new member the iSCSI sessions must follow in order to access the same blocks. In order to make this happen, the array notifies the initiator of a logout, and the initiator reconnects to the group IP and gets load balanced to the correct member. But the big question is why it's taking that long to re-establish connections.

What version vSphere? Are you using the Multipathing Equallogic Module? On the storage adapters are you seeing multiple active paths? Are you nearing the connection limits (1024 iSCSI connections) of the storage pool?

# ? Feb 23, 2012 23:17

ozmunkeh: Feb 28, 2008; hey guys what is happening in this thread

Is there anything wrong with NL-SAS drives for a small VMware installation (max 3 hosts, typical AD, Exch, SQL environment)?

We're looking seriously at an Equallogic PS4100X with 24 x 600GB 10K NL-SAS drives.

The capacity in fine, and the estimated IOPS look more than adequate but I'm not entirely convinced about the bit where everything is stored on a bunch of 2.5" SATA drives. It doesn't feel quite right. Anyone got one of these in production?

# ? Feb 24, 2012 02:14

Hyrax: Jul 23, 2004; I'm the Goon in the OP. Dispatch your messenger forthwith.

BnT posted:

Assuming you're using "Load Balancing" on your Equallogic group, these sound like asynchronous logouts coming from the array. You should be seeing these when adding or removing members from the group, but it shouldn't be disruptive. The array spreads the data across the members and when some data migrates to a new member the iSCSI sessions must follow in order to access the same blocks. In order to make this happen, the array notifies the initiator of a logout, and the initiator reconnects to the group IP and gets load balanced to the correct member. But the big question is why it's taking that long to re-establish connections.

What version vSphere? Are you using the Multipathing Equallogic Module? On the storage adapters are you seeing multiple active paths? Are you nearing the connection limits (1024 iSCSI connections) of the storage pool?

ESX is 4.1 Update 1, getting ready to roll out update 2. We're at 600ish iSCSI connections and that's something we keep a close eye on knowing the limit. We're excited to get get on ESX 5 so we can get away from the 2TB datastore limits. Going to 4+ TB datastores will be good for our iSCSI connection count.

I'm pretty sure that we've got the EQL MPIO stuff installed, but I didn't setup the cluster to begin with. I'll check that tomorrow when I'm in the office.

# ? Feb 24, 2012 03:00

Intraveinous: Oct 2, 2001; Legion of Rainy-Day Buddhists

ozmunkeh posted:

Is there anything wrong with NL-SAS drives for a small VMware installation (max 3 hosts, typical AD, Exch, SQL environment)?

We're looking seriously at an Equallogic PS4100X with 24 x 600GB 10K NL-SAS drives.

The capacity in fine, and the estimated IOPS look more than adequate but I'm not entirely convinced about the bit where everything is stored on a bunch of 2.5" SATA drives. It doesn't feel quite right. Anyone got one of these in production?

You say "stored on a bunch of 2.5" SATA drives", but then you say you're looking at an array with 600GB 10K NL-SAS. Unless things have changed a lot, 10K is usually not nearline, and if it's SAS, it's SAS...

Most configs I've seen with nearline drives are 7.2K high capacity drives, eg 1TB+.
They can be SAS or SATA, with SAS giving you better command queueing in some workloads and the possibility of being dual-pathed on the SAS backend. As far as I know, the fastest 2.5" 600GB drives you'll find are 10K. I've only seen 15K drives up to 146 or 300GB in SFF (2.5") drives.

I didn't really answer your question, but I wanted to make sure I knew what you were talking about for sure before attempting it.

# ? Feb 24, 2012 03:08

Internet Explorer: Jun 1, 2005

Intraveinous posted:

You say "stored on a bunch of 2.5" SATA drives", but then you say you're looking at an array with 600GB 10K NL-SAS. Unless things have changed a lot, 10K is usually not nearline, and if it's SAS, it's SAS...

Most configs I've seen with nearline drives are 7.2K high capacity drives, eg 1TB+.
They can be SAS or SATA, with SAS giving you better command queueing in some workloads and the possibility of being dual-pathed on the SAS backend. As far as I know, the fastest 2.5" 600GB drives you'll find are 10K. I've only seen 15K drives up to 146 or 300GB in SFF (2.5") drives.

I didn't really answer your question, but I wanted to make sure I knew what you were talking about for sure before attempting it.

To follow up on this post, if they are 7.2K drives, then your answer is, as always, it depends. You need to look very closely at your IOPs and your read:write ratios. If you have a higher write ratio and you put that all in RAID5/50, then you might run into trouble.

[Edit: Also, what version of Exchange? 2003 uses a ton of IOPs, 2007/2010 uses comparatively little.]

# ? Feb 24, 2012 05:07

evil_bunnY: Apr 2, 2003

Intraveinous posted:

Most configs I've seen with nearline drives are 7.2K high capacity drives, eg 1TB+.
They can be SAS or SATA, with SAS giving you better command queueing in some workloads and the possibility of being dual-pathed on the SAS backend. As far as I know, the fastest 2.5" 600GB drives you'll find are 10K. I've only seen 15K drives up to 146 or 300GB in SFF (2.5") drives.

Also, NL-SAS (as in, not designed with a bridge on top of SATA) have better error handling (+SCSI Protection Information) and lower controller latency

ozmunkeh posted:

Is there anything wrong with NL-SAS drives for a small VMware installation (max 3 hosts, typical AD, Exch, SQL environment)?

We're looking seriously at an Equallogic PS4100X with 24 x 600GB 10K NL-SAS drives.

PS4100E comes with NL-SAS, all PS4100X use with normal 2,5" SAS drives. SATA is a different story, but NL-SAS is SAS, just with ATA mechanics (platter, heads, RPM).
Tell us how many users on the exchange setup and the IO load on the SQL.

Internet Explorer posted:

[Edit: Also, what version of Exchange? 2003 uses a ton of IOPs, 2007/2010 uses comparatively little.]

This is pretty important. 2007 was less of an IO pig by an order of magnitude, and 2010 pretty much did it again.

Hyrax posted:

I'm pretty sure that we've got the EQL MPIO stuff installed, but I didn't setup the cluster to begin with. I'll check that tomorrow when I'm in the office.

If you're getting redirects it's not installed: it changes the connection handling from group RR to targetting the node it now knows has the blocks (well technically, the page) it wants.

evil_bunnY fucked around with this message at 12:54 on Feb 24, 2012

# ? Feb 24, 2012 12:39

Dilbert As FUCK: Sep 8, 2007; by Cowcaster; Pillbug

I haven't toyed around too much with things other than DRBD and started looking at Rsync, from my light googling it looks like it is just file level instead of block level. It also looks a good deal easier to set up too, only question I have is how does this play against a DRDB setup Active/Passive for an ESXi HA environment?

# ? Feb 24, 2012 15:47

Adbot: ADBOT LOVES YOU

# ? May 10, 2024 06:30

ozmunkeh: Feb 28, 2008; hey guys what is happening in this thread

I guess I have a fundamental misunderstanding of what NL-SAS is. I assumed it was marketing speak for what is essentially a SATA drive, hence my "bunch of SATA drives" comment.

Also I should say this is all coming off a conference call and I don't have the exact quote in my hands yet so my scribbled notes may not be 100% accurate. I know that the drives were 600GB 10K so I guess they're SAS after all.

As far as load, max 80 Exchange 2007 users with a dozen DynamicsGP on SQL2005 and a half dozen on a different app on SQL2008. Aggregate R/W across the whole environment right now is 80/20.

# ? Feb 24, 2012 15:47

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Enterprise Storage Megathread: Why is my NAS a SAN?

«‹›207 »