Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Ugh, our NetApp sales rep and reseller aren't winning any points in my book. It's not 2007 anymore, guys; this poo poo doesn't fly.

Adbot
ADBOT LOVES YOU

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Vulture Culture posted:

Has anyone pushed the COMSTAR iSCSI stack in Solaris/derivatives to its breaking point? How many concurrent iSCSI sessions does it tolerate on a single target before performance degradations or bad things start happening? Are there any bottlenecks I might expect to hit before tens of gigabits of network I/O becomes the relevant one?

Not with ISCSI and not with Actual Solaris, but the Illumos network drivers for 10 gigabit+ NICs are pretty hit or miss.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Vulture Culture posted:

Do you have a naughty/nice list, or is this mostly information relayed from others? I'd be willing to entertain Linux or BSD as the target server if it got real weird, but I still trust COMSTAR more than the LIO or the new BSD iSCSI native target.

Intel's is surprisingly bad (at least on the X520.) Mellanox never got beyond 10G on the ConnectX-2. Chelsio was recommended elsewhere; we've been using the T520-SO-CR and it seems to work reasonably well.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Vulture Culture posted:

Are you using optics, or direct-connect SFP+ twinax? I'm a little touchy around twinax after I spent 3 weeks chasing Emulex firmware bugs on my last attempt, but I need to keep costs way down on this project.

Optics, but the official Chelsio optics are cheap. We've used twinax elsewhere; it's OK, but it's occasionally a crapshoot going switch->host or between different vendors' switches. Also, gently caress Emulex.

If it's not something that's going to get you fired if it fails, here's an obligatory plug for http://www.fs.com . $16 for a 10Gbase-SR.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

No experience with their DACs. Our networking team stocks OEM DACs but fs.com optics; I'm not sure if that's just because DACs are a low-volume product for them or a result of a specific experience with a fs.com DAC. I've had fs.com cancel an order for a QSFP active optic cable in a weird configuration because they couldn't validate it in their test lab and hadn't shipped any in that configuration, so they're thinking about compatibility.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Anybody running OEM-rebranded E-Series enclosures? Any tips?

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

H110Hawk posted:

Yeah, I get paying for magic. Do you know what kind of prices we're looking at?

I realized I never expanded on the application: The multicast group is receiving metrics from several thousand servers, hundreds of metrics per second per server. These are stored both on a per-server basis and several aggregated dimensions, all in RRDs. (So 1000 servers in a role emit say 300 metrics/second which are stored both in 3,000 rrds and as 300 rrd aggregates (times some number of dimensions, depending.))

I hope you've turned rrdcache on. (The real answer is to switch to a modern tsdb.)

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Thanks Ants posted:

30TB is a poo poo ton more data than people realise.

:allears:

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Whenever we've gone down the ix/nexenta/tintri/(other zfs-derivative) route for quotes they always seem to be priced at 10% less than everyday netapp instead of the commodity plus 20% that id prefer to see. That may be an unrealistic expectation; ymmv.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Vulture Culture posted:

They may also legitimately have no idea how much their storage is costing them in terms of time to complete a run. They might just think it's supposed to be that slow, and they would never say anything unless a known quantity starts taking significantly longer to finish. It's worth doing an analysis to see the access patterns over time (thankfully easy on Linux with minimal setup).

I'd be interested in a thumbnail sketch (which tools?) on how you are doing that analysis if you wouldn't mind sharing.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Vulture Culture posted:

I haven't worked there in a good amount of time, but sure. I'm going to ignore general stuff on managing storage performance like keeping an eye on your device queue sizes, because there's plenty of good information out there already on that.

We ran a couple of different storage systems. BlueArc (now Hitachi HNAS), IBM SONAS (now hopefully dead in a loving fire), Isilon, and a bunch of one-off Linux/OpenIndiana/FreeBSD boxes, so we had a mixture of different vendor-specific tools, standard Linux utilities, and a bunch of client-side analysis we were doing. Understanding the user workloads was the first step. We dealt with a lot of the same applications between research groups (ABySS, Velvet, etc.), so we had a pretty good idea of how they were supposed to perform with different file formats. If someone's runs were taking abnormally wrong on the cluster, we'd start looking at storage latencies, access patterns, etc. Some file formats like HDF5 do well with random access, while others like FASTA generally work better when the whole thing is slurped into memory sequentially (though random access is possible if it's preprocessed with .fai or FASTQ indexes first).

Most storage vendors don't give you very good insight into what the workload is doing on disk. Where we could, we relied a lot on ftrace and perf_events to see what was happening. Brendan Gregg has an awesome utility called iolatency that can give you a latency histogram sort of like VMware's vscsiStats. This is mostly useful once you've already reached the saturation point where your latencies plateau, and you want to figure out what's going on in the workload underneath.

For some really insidious cases, we ended up trawling /proc/self/mountstats on each compute cluster node to get per-mount counters on each NFS operation. I actually wrote a Diamond mountstats collector to pull these numbers at 60-second intervals and forward them onto Graphite where they could be filtered, aggregated, and graphed -- this actually gave us a lot of really useful heuristic data on stuff like "how often is this job opening files?" and "does this application stream reads, or does it slurp everything in at start?" (We actually spotted a regression in SONAS's GPFS file locking behavior by diving deep into the performance characteristics associated with each specific NFS operation.)

Thanks; we're already pulling per-client NFS stats into Graphite but per-mount will be more useful; we've been looking server-side (perf, network, etc.) for usage information. Lustre's generic per-client stats aren't bad but I want to start using the jobstats feature to tag each in-flight IO with a job ID.

Brendan Gregg's book is good.

quote:

This is also true when you're looking at Gluster, Lustre, Ceph, etc. I don't get it. The offering isn't bad, but these companies never provide you the global logistics and support of the big vendors, so people buying must either be clueless or pressing them down to discount levels way lower than I was ever able to get.

In our experience if you look at GB/s/$ Lustre is very difficult for NetApp/EMC to come close to, for sufficiently large values of GB/s. That's assuming your workload will actually run well on Lustre (or any other clustered FS); while it is POSIX-compliant it really isn't a general-purpose file system. Highly recommend a partner that has experience with Lustre and has a contract with Intel for L2/L3 support.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Combat Pretzel posted:

So, RDMA on Intel, i.e. iWARP, are there actually products that support it? Google leads me to believe that there are 10GbE products that do RDMA, but per Intel, none of their adapters supports it.

Nope; I think they acquired a 10G NIC manufacturer who had cards that did it but it didn't make it into any of the Intel mainline NICs. IIRC Chelsio's the only one still shipping iWARP. In my limited (non iWARP) experience, they were OK? RoCE2 will do most of what iWARP did unless you really don't like Mellanox for some reason.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

For scale out performance, Isilon is fine, if a bit pricey. If you have petabyte scale needs it’d be on my shopping list. Their primary? engineers left and founded Qumulo. Similar arch. It’s a startup, so ymmv.

Panasas is fine, also kinda pricey. See them in enterprise HPC usually.

It might be worthwhile to talk to IBM about Spectrum Scale; they’ve been surprisingly competitive on some projects.

DDN has some options that aren’t bad.

People are doing interesting stuff with Ceph but that’s still pretty green. RH’s support licensing is kinda high.

Nearly all of these are built around large block IO. Most parallel file systems do poorly on metadata performance.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

NVMEoF is pretty cool, and there’s some fun things being built on top of it.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Happiness Commando posted:

I have a Dell MD 3820i full of SSDs on a 10 gig network and all of my benchmarks have random writes maxing out at 45 MB/s. Two different Dell teams have looked it over and and both of them say everything is configured correctly. The escalated pro support guy told me that the performance I was seeing was expected. The pro deploy guy thought that maybe my SSDs were bad. All 20 of them, I guess.

I hate Dell so much right now.

What raid config?

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Vulture Culture posted:

It's 2019. Disks should load in like ammo magazines

Xyratex had a fun bulk loader tool that would drop ten at a time from the shipping box into an enclosure.

But why buy disks when you can do weird poo poo with SCM and QLC https://www.vastdata.com/

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Vulture Culture posted:

Are you looking for any kind of replication? Because that's going to be an absolute shitshow on any platform with these file counts.

Transitioning off of our ZFS-based snapshot delta replication onto an enterprise replication has been an experience...

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

adorai posted:

It hurts me to do it, but I can recommend oracle for this sort of product. The ZFS appliances they acquired in the Sun acquisition are pretty great. And they aren't necessarily the "gently caress you" kind of pricing Oracle is known for.

They rif’d most of the remaining zfs/solaris devs last year (?) though.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

There’s some hdparm (?) or scsi commands to flip 512/520 sector size. I found a blog post about it like five years ago but did not save it. You’ll need at least one non raid controller with direct access to the scsi devs tho.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Thanks Ants posted:

How do they manage to make them so deep?

Very carefully.

(I’m looking at a few platforms well north of 40”.)

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

evil_bunnY posted:

If I want a half a TB of HA NFS *delivered quickly*, who should I be talking to?

Netapp?

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

evil_bunnY posted:

We're already talking to them and Dell (Isilon). I'd like to know if I'm missing non-obvious players with EMEA presence.

Thread frequently mentions Pure Storage? I really like Vast but they’re probably out of scope for 500G.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

EFS :devil:

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

There’s no right number of disks for Ceph. You either have not enough disks or hosts or have way too many.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Qumulo is fine. It’s Isilon 2.0. There are some performance edge cases to be careful with.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Docjowles posted:

I'm a little jealous of people who get to play with cool/large storage poo poo. Feel like that's easily my biggest technical blind spot. Somehow all of my jobs have been one of

1) No major storage needs beyond like a Synology NAS
2) Boss pathologically opposed to the concept of shared storage because it's a single point of failure or other weird excuses (running critical workloads on a single host with a big rear end disk array hanging off it is better because ????????)
3) Petabytes of NetApp but there was already a dedicated and awesome storage engineer so I never really had to deal with it

These days I'm entirely working in the cloud and the only interesting aspect of storage is explaining to management how the hell they racked up a 6 figure monthly S3 bill

Very easily, lol.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

BlankSystemDaemon posted:

I've been out of the enterprise storage industry for a while, which I haven't been posting much ITT - but this 30TB 2.5" U.2 Kioxia drive did catch my attention, because 40PB/rack does sound pretty good, even if the 4KB QD64 and 16KB QD32 random IOPS aren't very good, the NVMe interface still offers 2^16 queues with 2^16B each, and the sequential IOPS is pretty alright.
Only real downside is that it's got a DWPD of only 1.

I think the PB/1RU density is for the ruler drives; I think supermicro u.2 servers top out at like 24 per U which leaves you at a paltry 30PB per rack.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

evil_bunnY posted:

Do any of you run on-prem object stores (±1PB) you're happy with?

Ceph. It’s fine. Kinda fte intensive.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Yeah. Vendored Ceph is expensive and a real pain in the rear end if you’re not in sync with mainline Ceph development.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Vast is in scope for a PB and that performance and would be in budget but their big trick is dedupe so it might not be that great; although they have a probe to run against your data set that’s worth checking out.

DDN has a couple of interesting QLC platforms that might be worth looking at.

There’s a few Ceph plus glue systems out there; some people like them.

I’d check out Qumulo; it might be a good fit for a streaming workload.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Yeah, the Red Hat Ceph support is ridiculous. Big streaming IO is a pretty well understood pattern in HPC but those usually come with more consistency guarantees (and complexity) than you want.

Adbot
ADBOT LOVES YOU

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Panasas is on the legacy side but I know some folks like them. A well designed GPFS isn’t bad but it is very network sensitive; IBM and Lenovo have products that can be very competitive. There are a few others that build GPFS platforms as well but they need to pay for their platform and the gpfs licenses.

Lustre is open source and a lot of the problems people tended to have with it have been mitigated in the newer versions (2.15+), but is pretty admin intensive. DDN and HPE have solutions. The development is primarily done by DDN; AWS building a FSx on top of community Lustre has kinda taken a lot of steam out of feature development, imho.

Weka comes up in conversations with sales people; they like to tier out to an object store.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply