Enterprise Storage Megathread: Why is my NAS a SAN?

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Enterprise Storage Megathread: Why is my NAS a SAN?

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

ok here's one that annoys me.

Can the "tps" field in an iostat -d be reasonably equated to the IOPS measurements you'd see in navianalyzer or 3par service reporter?

edit: same goes for sar -b

StabbinHobo fucked around with this message at 03:20 on Aug 31, 2008

# ¿ Aug 31, 2008 03:16

Adbot: ADBOT LOVES YOU

# ¿ Apr 30, 2024 01:48

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

vendors almost always quote the iops that the controller nodes can handle if they had an infinite number of spindles behind them. Read it as a ceiling not a median or a floor.

# ¿ Sep 19, 2008 21:42

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

Click here for the full 1088x747 image.

magic yellow box

# ¿ Sep 20, 2008 01:17

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

^^^ I wish, I harped on them about exposing an snmp interface and mib a ton but no budge, they'd rather sell their "reporter" product

Catch 22 posted:

What? Is that a tool? Google turns up nothing for that. How did you make that chart?

its the built in stats grapher in the 3par gui client

# ¿ Sep 20, 2008 19:21

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

kind of a corner case question...

pretend for a moment you've been stuck with a pretty decent SAN. We're talking raid10 across 40 15k spindles and 2GB of write cache (mirrored, 4gb raw), plenty of raw iops horsepower.

but you need nas

your application is specifically designed around a shared filesystem (gfs), changing that would require lots of rewrite work. gfs, for various reasons, is not an option going forward. So its nfs or something more exotic, and exotic makes me angry.

what product do you shim inbetween the servers and the san to transform nfs into scsi? Preferably under 30k with 4hr support, a failover pair would be nice too. Now, I know about the obvious "pair of rhel boxes active/passive'ing a gfs volume", but I also want to evaluate my alternatives. Extra special bonus points if it can do snapshots and replication.

Does netapp make a "gateway" model this cheap?

The ideal product would be two 1u box's running some embedded nas software on an ssd disk, with ethernet and fibrechannel ports, all manageable through a web interface with *very* good performance analysis options.

Can you tell I wish sun would sell a 7001 gateway-only product real bad?

# ¿ Feb 14, 2009 23:39

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

yea I stumbled across the v3020 the other day and it seemed perfect, until my san vendor said it'd be unsupported on both sides.

Right now I'm looking at Exanet, anyone got any opinions?

# ¿ Feb 18, 2009 01:32

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

I just spent the last two days fighting with fdisk, kpartx, multipathd, multipath, {pv|vg|lv}scan, {pv|vg|lv}create, udev, device mapper, etc. Seriously what a retarded rube goldberg contraption.

oh well, lets trade sysbench results

quote:

[root@blade1 mnt]# sysbench --test=fileio --file-num=16 --file-total-size=8G prepare sysbench 0.4.10: multi-threaded system evaluation benchmark

16 files, 524288Kb each, 8192Mb total
Creating files for the test...

[root@blade1 mnt]# sysbench --test=fileio --max-time=300 --max-requests=1000000 --file-num=16 --file-extra-flags=direct --file-fsync-freq=0 --file-total-size=8G --num-threads=16 --file-test-mode=rndrw --init-rng=1 --file-block-size=4096 run
sysbench 0.4.10: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 16

Extra file open flags: 16384
16 files, 512Mb each
8Gb total file size
Block size 4Kb
Number of random requests for random IO: 1000000
Read/Write ratio for combined random IO test: 1.50
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Threads started!
Time limit exceeded, exiting...
(last message repeated 15 times)
Done.

Operations performed: 482169 Read, 321436 Write, 0 Other = 803605 Total
Read 1.8393Gb Written 1.2262Gb Total transferred 3.0655Gb (10.463Mb/sec)
2678.55 Requests/sec executed

Test execution summary:
total time: 300.0154s
total number of events: 803605
total time taken by event execution: 4798.9296
per-request statistics:
min: 0.08ms
avg: 5.97ms
max: 1600.19ms
approx. 95 percentile: 23.35ms

Threads fairness:
events (avg/stddev): 50225.3125/358.34
execution time (avg/stddev): 299.9331/0.00

[root@blade1 mnt]# sysbench --test=fileio --max-time=300 --max-requests=1000000 --file-num=16 --file-extra-flags=direct --file-fsync-freq=0 --file-total-size=8G --num-threads=16 --file-test-mode=rndrw --init-rng=1 --file-block-size=8192 run
Block size 8Kb
2576.43 Requests/sec executed

[root@blade1 mnt]# sysbench --test=fileio --max-time=300 --max-requests=1000000 --file-num=16 --file-extra-flags=direct --file-fsync-freq=0 --file-total-size=8G --num-threads=16 --file-test-mode=rndrw --init-rng=1 run
Block size 16Kb
2427.95 Requests/sec executed

[root@blade1 mnt]# sysbench --test=fileio --max-time=300 --max-requests=1000000 --file-num=16 --file-extra-flags=direct --file-fsync-freq=0 --file-total-size=8G --num-threads=16 --file-test-mode=rndrw --init-rng=1 --file-block-size=32768 run
Block size 32Kb
2132.02 Requests/sec executed

[root@blade1 mnt]# sysbench --test=fileio --max-time=300 --max-requests=1000000 --file-num=16 --file-extra-flags=direct --file-fsync-freq=0 --file-total-size=8G --num-threads=16 --file-test-mode=rndrw --init-rng=1 --file-block-size=65536 run
Block size 64Kb
1716.62 Requests/sec executed

dell m610 running centos 5.3 (ext3 on lvm) connected to a xiotech emprise 5000 using a raid10 lun on a "balance" datapac

StabbinHobo fucked around with this message at 05:43 on Jun 10, 2009

# ¿ Jun 10, 2009 04:58

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

bmoyles posted:

What's the scoop on Coraid, btw? I tried looking into them a few years back, but they didn't do demo units for some reason so I passed.

Its kinda weird right? hypothetically it sounds like an awesome product that should have taken off... but it didn't... but you don't have widespread report of problems either. I mean, the industry's all wet in the pants about fcoe, while aoe is more or less the same thing available years ago.

# ¿ Jun 16, 2009 04:46

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

this is probably a longshot... but is there any hope of a fishworks gateway? that is, the ability to run fishworks on some 1U nothing server and connect it to pre-existing san storage? I would *love* those reporting tools but already have a significant storage investment to work with.

# ¿ Jun 30, 2009 05:32

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

I was hoping for sort of netapp sales engineer feedback but from one of you cuz yea SEs.

I have a web content management system that produces a lot of flatfiles as pre-rendered components of web pages, most of them containing php code then executed by the web front ends (all rhel/centos env). Its about 100GB of files with about 5.5 million files and directories (uuuggghhh). There's a live copy of the data (high traffic, distributed across 8 servers via rsync shenanigans), a staging copy of the data (low traffic), and over a dozen development copies of the data (very low traffic).

It seems to me that moving these all to nfs mounts on a de-dup'd volume would be pure nirvana. Am I reading that right?

Live currently has the spindles to push about 1200 IOPS, I have no idea what that would translate to in nfs ops. Whats the lowest end netapp with the least amount of storage I could get away with? Ideally they would still support asynchronus-but-near-realtime (aka not a 5min cron job) replication of the live filesystem.

I've got a quote for a pair of FAS2020's with 15x300GB@10k, and I and I would really like to get the price down like 30+%. Its from CDW so I could probably start with VAR shopping, but can I whittle hardware?

# ¿ Jul 31, 2009 05:06

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

1000101 posted:

Honestly, the CPU in the 2020 is a little anemic; that said you can probably see some pretty good benefits. How frequently do those files get read? Are you actually performing 1200 IOPS or do you just have enough capacity to do it?

How far away is the replication target going to be? What connectivity? You might even get synchronous depending on how much data actually changes.

As far as VAR shopping, you can probably get a little bit more off the top since they're going to hope that you'll keep using them for other poo poo.

thanks

they tend to putter around at 40 - 80 iops each (so x8) most days. when you say the cpu is anemic, what would that mean? de-dup would cause lags? someone running a find might hose everything else at once? Certainly the 15 10k spindles could keep up and then some (btw, no half-shelf options or anything?)

# ¿ Jul 31, 2009 19:52

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

not sure, this is an old quote that we had decided against but since the alternative has gone nowhere for four months I'm looking to bring it back up. Is SnapMirror their brand-phrase for replication?

the two sites are about 70 miles apart and would have a 100mbps ethernet link between them.

would the 2050 with half a shelf cost less than a 2020 with a whole shelf?
edit: come to think of it, i want 12 spindles worth of iops anyway, so I suppose its moot.

StabbinHobo fucked around with this message at 23:11 on Jul 31, 2009

# ¿ Jul 31, 2009 22:53

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

its absolutely insane for anyone without two very niche aspects to their storage needs:
- can be slower than poo poo sliding up hill
- can afford a 67TB outage to replace one drive

I think I'd trust mogilefs's redundancy policies more than linux's software raid6.

# ¿ Sep 3, 2009 01:05

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

adorai posted:

since it's apparently commodity hardware, i think you could probably run opensolaris w/ raidz2 on it if you wanted to.

i wouldn't even bother. the power situation on those boxes isn't even remotely trustworthy, any "raid" needs to be between boxes.

# ¿ Sep 3, 2009 04:03

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

I have a CentOS 5.3 host with a "phantom" scsi device because the LUN it used to point to on the SAN got un-assigned to this host. Every time I run multipath it tries to create an mpath3 device mapper name for it and complains that its failing.

How do you get rid of /dev/sde if its not really there?

edit: as usual I figure something out the moment I post about it. I ran this echo "1" > /sys/block/sde/device/delete and it worked. Anyone care to tell me I just made a huge mistake?

StabbinHobo fucked around with this message at 20:45 on Feb 5, 2010

# ¿ Feb 5, 2010 20:41

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

I just got my new storage array setup, and I have a server available for benchmark testing. Its a dual-dual-core w/32GB of ram and a dual-port hba. each port goes through a separate fc-switch one hop to the array. all links 4Gbps. Very basic setup.

I've created one 1.1TB RAID-10 LUN, and one 1.7TB RAID-50 LUN, both have their own but identical underlying spindles.

Running centos 5.4, with the epel repo setup and sysbench, iozone, and bonnie++

I'm pretty familiar with sysbench and have a batch of commands to compare to a run on different equipment earlier in the thread. But not so much with bonnie and iozone. I'd be particularly interested in anyone with an md3000 to compare with.

# ¿ Feb 17, 2010 03:21

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

Why are the numbers for sdb so different from the underlying dm-2, and where the hell does dm-5 come from?

code:

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     1.00  0.00  1.80     0.00    22.40    12.44     0.00    0.78   0.78   0.14
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     1.00  0.00  1.80     0.00    22.40    12.44     0.00    0.78   0.78   0.14
sdb               0.00 73195.80  0.00 1169.20     0.00 595726.40   509.52   130.10  110.91   0.81  95.10
sdc               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdd               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sde               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-0              0.00     0.00  0.00  2.80     0.00    22.40     8.00     0.00    0.50   0.50   0.14
dm-1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00  0.00 74363.40     0.00 594907.20     8.00  8295.81  111.20   0.01  95.10
dm-3              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-4              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-5              0.00     0.00  0.00 74363.40     0.00 594907.20     8.00  8295.92  111.20   0.01  95.10

code:

[root@localhost ~]# ls -l /dev/dm-*
brw-rw---- 1 root root 253, 2 Feb 17 07:22 /dev/dm-2
brw-rw---- 1 root root 253, 3 Feb 17 07:21 /dev/dm-3

code:

[root@localhost ~]# lvdisplay /dev/vg-r1/lvol0
  --- Logical volume ---
  LV Name                /dev/vg-r1/lvol0
  VG Name                vg-r1
  LV UUID                6f2nvg-c24t-lFez-sxWc-Hpmg-ZIlF-hIX4J5
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                1.06 TB
  Current LE             276991
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:5

[root@localhost ~]# vgdisplay vg-r1
  --- Volume group ---
  VG Name               vg-r1
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  2
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               1
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               1.06 TB
  PE Size               4.00 MB
  Total PE              276991
  Alloc PE / Size       276991 / 1.06 TB
  Free  PE / Size       0 / 0
  VG UUID               WuYvoz-G9M0-9J9I-p8lC-flxW-vHKb-FJAirR

[root@localhost ~]# pvdisplay /dev/dm-2
  --- Physical volume ---
  PV Name               /dev/dm-2
  VG Name               vg-r1
  PV Size               1.06 TB / not usable 4.00 MB
  Allocatable           yes (but full)
  PE Size (KByte)       4096
  Total PE              276991
  Free PE               0
  Allocated PE          276991
  PV UUID               9BZFZ8-2Pbk-zCKr-eya2-q2Rr-3z0U-7XR9mY

# ¿ Feb 17, 2010 19:06

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

another loving random filesystem-goes-read-only-because-you-touched-your-san waste of an afternoon.

I have four identically configured servers, each with a dualport hba, each port going to a separate fc-switch, each fc-switch linking to a separate controller on the storage array. All four configured identically with their own individual 100GB LUN using dm-multipath/centos5.3.

I created two new LUNs and assigned them to two other and completely unrelated hosts to the four mentioned above.

*ONE* of the four blades immediatly detects a path failure, then it recovers, then detects a path failure on the other link, then it recovers, then detects a failure on the first path again, and says it recovers, but somewhere in here ext3 flips its poo poo and remounts the filesystem readonly.

Now, if I try to remount it, it says it can't because the block device is write protected. However multipath -ll says its [rw].

# ¿ Mar 10, 2010 19:51

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

Zerotheos posted:

VAR. Tried to go directly through Netapp, but they kept referring us away. I couldn't get anything that even remotely resembled competitive pricing.

Oracle/Sun, on the other hand, told us not to bother with VARs and that they'd help us directly. Their direct pricing beat the hell out of the VAR.

Dell/Equalogic was incredibly eager about making a sale. Visited us 3 times, brought cookies, muffins, lunch, demo units. Unbelievably, they wouldn't give me any manual/user guide documentation on the PS6000 series. They were upset to the point of nearly being unprofessional when we decided against their product, even through we ended up buying all the server hardware for the project from them anyways.

Nexenta, I emailed with my first contact quote request at about 7:30 in the evening and they provided a quote within 3 minutes. Being a software solution with a very usable free version, it allowed me to build a full system in testing before I even bothered contacting their sales.

you're singing my tune and all, but i just spent 5 minutes on their website and its such standard vendor-brochure crap I can't find any useful information

# ¿ Apr 1, 2010 04:52

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

technically, he's right, and you're dumb. hopefully though you're paraphrasing and he at least tried to explain it better than that.

compellent, 3par, xiotech, and presumably many others long ago stopped using an actual set of disks in a raid array as the backing store for a LUN. They maintain a giant datastructure in memory that maps virtual-LUN-blocks to actual-disk-blocks in a way that spreads any given LUN out over as many spindles as possible. So if you have a "4+1" RAID-5 setup on a 3par array for instance, what that really means is that there are four 256mb chunklets of data, per one 256mb chunklet of parity, but those actual chunklets are going to be spread out all over the place and mixed in with all your other LUNs effectively at random. When one of the disks in the array dies, all of its lost-blocks/chunklets are rebuilt in parallel on all the other disks in your array. Its not a 1-1 spindle-for-spindle rebuild like in a DAS RAID-5 situation, so your risk of it taking too long or having an error are extremely reduced.

The only reason these guys are gonna start selling a RAID-6 feature in their software is cuz they'll get sick of losing deals to customers who know just enough to be dangerous.

edit: here's a picture that might help it click.

StabbinHobo fucked around with this message at 01:41 on Apr 2, 2010

# ¿ Apr 2, 2010 01:35

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

Yea thats almost exactly what I was saying. They're implementing it because it makes for good buzzphrase-packed marketing copy. Keep in mind, the design in question only reduces the risk by a massive degree, its never fully gone. Therefore they can straight-faced claim to be making something more reliable with raid6, even if so minutely that its almost being deceptive to act like it matters.

edit: remember what you think of as "raid 5" or "raid 6" are really just wildly oversimplified dumbed down examples of what actually gets implemented on raid controllers and in storage arrays. 3par for instance doesn't even write their raid5 algorithm, they license it from a 3rd party software developer and design it into custom ASIC chips. Think of it like "3G" for cellular stuff or "HDTV" for video, there's a lot of different vendors making a lot of different implementation decisions under the broad term.

StabbinHobo fucked around with this message at 02:19 on Apr 2, 2010

# ¿ Apr 2, 2010 02:14

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

adorai posted:

They use raidsets

no, they don't. you missed the point or have a lovely sales engineer, go talk to him again.

quote:

which means raid 5 is not reliable enough for anything other than archival in the enterprise.

wait, you lost me here, is that because its important to spend money on buzzphrase technology you don't understand in the enterprise, or is that because archival storage is ok to lose data?

# ¿ Apr 2, 2010 02:29

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

Zerotheos posted:

The site is a bit confusing to navigate, but it's all there. Note that 3.0 just released so some of the information is still referencing 2.2.x. 3.0 main features are zfs dedupe, much improved hardware compatibility and improvements to the web interface. I believe the hardware support is the same as the latest OpenSolaris release.

Pricing
Feature set
Documentation
3.0 enterprise trial
3.0 community edition (free, no support, 12tb raw limit, no plugins)

thank you

Spent a couple hours digging around on this. It seems to have a real community and install base out there, but its small and I'm not seeing many people using it as NAS for high-traffic webservers (small files, nfs). In fact I saw one report of a guy getting much worse performance with nfs than cifs and no one responded.

Plus, its a shame their replication is block-device level not filesystem level. That means you can't use the second-site copy read only.

# ¿ Apr 2, 2010 02:44

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

EnergizerFellow posted:

The double-parity thing has more to do with spindles only having an error rate of 10^14 or 10^15 and the chances of corrupt data is quite high when you're rebuilding a multi-TB RAIDed datasets.

http://blogs.zdnet.com/storage/?p=162

by his own numbers it takes 2TB drives to become a problem in a 7disk raid5 array. so if you're in a 16-disk shelf of 1.5TB disks, you're more than fine. and thats if you're running a raid controller written by some kind of undergraduate CS student for a homework assignment to get the data loss he talks about in his example.

yes these are real hypothetical problems in our near term future

no they are not a real *actual* problem in our this-cycle storage purchasing decisions

# ¿ Apr 2, 2010 03:05

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

I'm curious to hear other peoples feedback here...

I can't think of any reason to actually use partition tables on most of my disks. Multipath devices are one example, but really even if I just add a second vmdk to a VM... why bother with a partition table? Why mount /dev/sdb1 when you can just skip the whole fdisk step and mount /dev/sdb ?

Why create a partition table with one giant partition of type lvm, when you can just pvcreate the root block device and skip all that? What do the extra steps buy you besides extra steps and the potential to break a LUN up into parts (something I have no intention of ever doing).

# ¿ Apr 2, 2010 19:15

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

Misogynist posted:

I got to see what happens when an IBM SAN gets unplugged in the middle of production hours today, thanks to a bad controller and a SAN head design that really doesn't work well with narrow racks.

(Nothing, if you plug it right back in. It's battery-backed and completely skips the re-initialization process. Because of this incidental behavior, I still have a job.)

it turns out there are some good reasons this poo poo costs so much

# ¿ Jun 29, 2010 06:02

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

even if oracle isn't going to kill the stuff it bought for nefarious reasons like they do to almost every other business they buy, they're going to do it for reasons of simply being a giant enterprise closed source software company.

you can argue they didn't "kill" berkelydb when they bought it, they just choked it off from the rest of the community. same exact thing happened to innodb. same thing will happen to mysql and opensolaris now.

fishworks may live on as a top notch storage product, but from here own out it will be developed to compete with netapp and force you into the same dozens-of-sub-license-line-items way of doing business.

# ¿ Jul 16, 2010 16:36

Adbot: ADBOT LOVES YOU

# ¿ Apr 30, 2024 01:48

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

hey everybody

an old coworker buddy of mine is at a new startup where they intend to produce "a lot" of video. They brought in a storage consultant who quoted 300k and he's balking. I told him its probably at least close to reasonable, but I'd look around for a second-opinion consultant. Anybody know anyone good in the NYC area?

# ¿ Nov 23, 2011 19:47

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Enterprise Storage Megathread: Why is my NAS a SAN?