Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Computer viking
May 30, 2011
Now with less breakage.

Speaking of mirrors and raidz. I'm setting up a replacement file server at work, with 15x 16TB drives. Last time I did this, I had the impression that using mirrors was also preferable because it makes resilvering way quicker. Is that still a thing, or is resilvering a drive in raidz quick enough in practice?

I'm debating mirrors vs some level of raidz. The IO loads here are usually fairly nice; mostly reading and writing a single large file at the time (over NFS on 10gbit or SMB on 1gbit), though I'm sure there's sporadic "thousand tiny files at once" cases. I have enough disks to get enough space (for now) with mirrors, but more space is more space. Then again, more IOPS never hurts either. I also have 2x500GB NVME SSDs I'll need to consider the best use for, and a 300GB SAS SSD that Dell tossed into the server.

Tempted to just go with 7x mirrored pairs, one hot spare. As for the SSDs, uh. To be considered and perhaps benchmarked.

e: On a related note, I'm not overly impressed with the two PCIe slots in this server. There's fortunately also an OCP connector, but I'd happily trade that for more plain PCIe slots; there's plenty of space for them. Maybe even an NVME slot or three on the vast open expanses of motherboard. As is, NVME-on-PCIe-card + external SAS HBA + 10gbit NIC means I have zero PCIe connectivity left. Oh well, I assume the internal SAS controller eats a fair few lanes, and I don't know how many you get to play with on a modern Xeon.

e2: The 15 disks are not a hard limit or an ideal number, it's just what our vendor had in stock. There's space for 20.

Computer viking fucked around with this message at 10:56 on May 31, 2022

Adbot
ADBOT LOVES YOU

Computer viking
May 30, 2011
Now with less breakage.

BlankSystemDaemon posted:

Well, raidz resilver is limited by the speed that the CPU can do Reed-Solomon P(+Q(+R) - but if your CPU has AVX2 and you're using an implementation of OpenZFS that's new enough, the maths been vectorized and even without AVX2 it's still done using SIMD via SSE, so in practice you're usually limited by disk read speed rather than the CPU.
The other bottleneck can be the checksum algorithm; if you're using fletcher and your CPU supports SHA256 in hardware (which some newer Xeons do), you should absolutely be using that instead.

Striped mirrors can still be faster because it's almost guaranteed to be doing streaming I/O from one part of the mirror to the other, whereas there are scenarios where raidz resilver will look way more like random I/O - but all of that will depend on how you disks have been written to, which can be usecase-optimized to a certain extent.

If you have 18 disks in 2x raidz3, there's room for a couple hot-spares.
Alternatively, you can use draid as the entire point of that is to speed up raidz rebuilds by having spare blocks distributed across the vdev.

One option for the two NVMe SSDs is to use them with allocation classes as documented in zpoolconcepts(7) and setting recordsiz as well as special_small_blocks (documented in zfsprops(7)), so that the metadata as well anything smaller the threshold gets written to the SSDs - my recommendation would be that anything below the native sector size of the spinning rust should go there, but you can choose anything up to one power of 2 below your recordsize.

The big thing to remember is that the special vdevs used by allocation classes are not cache disks. The metadata and small blocks aren't stored anywhere else, irrespective of whether you're using it with striped mirrors or raidz.

EDIT: Special vdevs are also one of the only ways of making dedup bearable.
Although if you're doing dedup, you'll want to use zdb -S to simulate building a DDT (or do some napkin maths, with 80 bytes per on-disk LBA), because I'm pretty sure 500GB SSDs are too small for metadata+smallblocks+dedup.
You'll also have want to have sha256 checksumming and a CPU or QuickAssist daughterboard that can offload it.

Right - that's quite useful; thanks.

It's a xeon silver 4309Y, which is listed on ARK as having AVX-512 and AESNI, so I think I'm good there. Also, IO is realistically not the bottleneck for most things we do - if it can deliver a few hundred MB/sec in most uses, that's more than enough. Doing so efficiently never hurts, of course. And thanks for the sha256 tip, that's a smart use of CPU features I wouldn't have thought of. Also, it's good to hear the resilver should ... mostly be fast enough on raidz? I'm not entirely comfortable with the redundancy pattern of a large pack of two-disk mirrors, nor the 50% "waste" - though the performance is at least good. Still, I won't exactly mind using something else.

2x 9 disk raidz3 sounds very reasonable, though I'll have to wait for the last few disks; I'll test with 2x 7 disk raidz1 for now and prepare to nuke and reconfigure before putting it to use.
I like the idea of using the NVME disks as a special vdev - presumably an NVME mirror should be resilient enough. (And if it goes suddenly and completely bad, I guess that's why I set up the tape backup). I do remember hearing the BSD Now guys talk about the "redirect small writes to SSD" idea, but somehow didn't consider doing it here.

Maybe the 300GB SSD would work as as L2ARC, if it's decently fast? It's a role where it's fine that it's non-redundant, at least.

I think I'll just hold off the dedup. I've briefly tested it, and it doesn't really do a lot on our data; there's very little redundancy to work with.

Computer viking fucked around with this message at 12:37 on Jun 1, 2022

Computer viking
May 30, 2011
Now with less breakage.

BlankSystemDaemon posted:

As for 300GB L2ARC, that might not be the most efficient use of your memory. What's your working set (ie. what do you get from zfs-stats -A, if you're on one of the BSDs?), and how much system memory do you have?
L2ARC requires mapping LBAs from the SSD into memory, and every map of every LBA takes up about 80 bytes, so as an example, a 256GB SSD needs ~37GB memory to map all 300GB of SSD storage into memory, and that's taking away memory resources not just from the ARC (which is orders of magnitude faster, compared to a SAS SSD), but also from user-space (ie. sambad, nfsd, and whatever else runs on the server).
(80*500118192)/(1024/1024/1024)=37.3GB, but it's not quite 80 bytes per LBA, I just don't remember the exact figure.

Ah, I didn't realize it was quite that expensive. I'll just use it as a boot drive, then.

As for the working set, the current server has just 32GB of RAM, and the new one 64GB. I'd like a lot more, but realistically I don't think it'll be a huge problem?
code:
------------------------------------------------------------------------
ZFS Subsystem Report                            Wed Jun  1 16:21:04 2022
------------------------------------------------------------------------

ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                208.96  m
        Recycle Misses:                         0
        Mutex Misses:                           204.19  k
        Evict Skips:                            72.51   k

ARC Size:                               55.94%  15.66   GiB
        Target Size: (Adaptive)         62.55%  17.51   GiB
        Min Size (Hard Limit):          12.50%  3.50    GiB
        Max Size (High Water):          8:1     28.00   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       18.06%  3.16    GiB
        Frequently Used Cache Size:     81.94%  14.35   GiB

ARC Hash Breakdown:
        Elements Max:                           7.08    m
        Elements Current:               54.04%  3.83    m
        Collisions:                             201.20  m
        Chain Max:                              11
        Chains:                                 972.67  k

Computer viking
May 30, 2011
Now with less breakage.

BlankSystemDaemon posted:

Yeah, L2ARC is expensive, memory-wise - but if you've got a working set that's several TB large, and don't have one of the high-end Xeon Platinum that can take multiple TB of memory, it's the only way to have any caching.

You cut off the statistics I was looking for, here's an example from my FreeBSD development laptop:

No, that's literally everything. Looking at it, I guess I should have used -AE or even -AEL? (The current server has an L2ARC set up.)
Of course, I did just reboot to cold-swap in a new drive, so it'll take a bit to get representative numbers again.

e: And sorry for the repeated editing of this post, I should draft before posting.

Computer viking fucked around with this message at 23:21 on Jun 1, 2022

Computer viking
May 30, 2011
Now with less breakage.

For historical reasons, it's not set up for boot environments - I think this install started as a 10.x in 2015, and I haven't really bothered to try and retrofit it in there.

But yes, it's really about time to jump to 13.1.

Computer viking
May 30, 2011
Now with less breakage.

Also, I'm now on 12.3, which was uneventful. I'll try 13 later.

Computer viking
May 30, 2011
Now with less breakage.

Also, if you're doing a five-disk raidz3, you'd get the same capacity and better performance from two mirrors and a hot spare, though of course traded against a bit higher risk.

Computer viking
May 30, 2011
Now with less breakage.

While obviously worse than a z3, those are not atrocious numbers. There's also the effect of having a hot spare around - ideally, that should shrink the window of vulnerability down to the time it takes to resilver a mirror?

Of course, real life problems like batch effects and identical aging takes those numbers way down, so there's nothing wrong in being more careful. I would at least consider it, though.

Computer viking
May 30, 2011
Now with less breakage.

BlankSystemDaemon posted:

A hotspare isn't going to give you more availability, it just means you can automate device replacement.
That requires zfsd or zed though, and even then it probably also needs SES.

If we consider the chance of losing data from a mirror as "hours spent in single disk operation × chance of failure per hour", wouldn't anything that reliably shortens that window (by immediately starting a resilvering) reduce the overall chance of data loss?

I must admit I haven't ever touched the automated disk replacement tools, though. Maybe on the new server.

Computer viking
May 30, 2011
Now with less breakage.

This feels like it's more of a question of definitions - in which case I assume you're right. :)

Still, consider a hypothetical. Two identical companies run identical storage hardware with identical raid levels. One has a well tested automatic hot spare system, and one has a weekly "walk around and replace broken disks" routine. Everything else being equal, I would expect the latter to have more cases of the second disk dying and taking out the mirror before the first failed disk was replaced, and thus more downtime (and restores from backups)?

Point taken on this not being a "just works" kind of thing, though.

Computer viking
May 30, 2011
Now with less breakage.

BlankSystemDaemon posted:

Sure, but ECC is good for much else besides this, so it's never a bad idea to have it if you can get it.

Also, a recent commit in FreeBSD added support for RDMA with RoCEv2 and iWARP for the E810.

Oh 100Gbit, that's fun.

I'm still thinking about setting up some 10Gbit at home, and I genuinely don't have the storage or consumers to make use of more. Even at work 10Gbit over ethernet is honestly more than good enough - but it would be neat to play with RDMA just to have some experience with it.

Computer viking
May 30, 2011
Now with less breakage.

BlankSystemDaemon posted:

If you're doing iSCSI over Ethernet (especially if backed with NVMe storage, but even without), you benefit quite a bit from the 10 times faster latency on fiber with SFP+ modules compared with RJ45.
Other than that, it's not gonna mean much.

I have played with it, but no - it's all plain file storage. Makes sense, though.

wolrah posted:

For homelab-level fuckery 40G is the real sweet spot IMO. The hardware isn't much more expensive than 10G (in some cases it's cheaper) and it's generally compatible with 10G using inexpensive SFP+>QSFP+ adapters because it's literally just 4x10G acting as a single interface. Some 40G hardware, most commonly switches but also nicer NICs, can even break out those links in to four independent 10G interfaces.

That is an interesting idea - though I'm very dependant on what shows up used; I'm not paying full price just for the bragging rights learning experience. I'll keep it in mind when searching, though. :)

Computer viking fucked around with this message at 17:33 on Jun 14, 2022

Computer viking
May 30, 2011
Now with less breakage.

It's worth noting that I'm in Norway, and something about being in the EEC and Schengen and whatever else is relevant, but not the EU, makes the cost of international shipping here really unpredictable.

I'll keep the Chelsio cards in mind, though.

Computer viking
May 30, 2011
Now with less breakage.

Looks like I can get mellanox 40gbit cards without transceivers for €67 from Germany, which isn't half bad. It's another €35 for shipping if I'm not happy with "early July to early august", but ... I may be.

Computer viking
May 30, 2011
Now with less breakage.

necrobobsledder: Can I guess that your use is something like "I have a wheeled rack of gear, and I'd like to ship it to some foreign country and back and have it work by just swapping the power lead into the UPS"? (Where the gear itself isn't overly picky about voltage and frequency)
Or is the gear itself picky, so you'd also like "always output 120V/60Hz"?

Not that I can help you at all; I'm just wondering if I'm understanding it right. :)

Computer viking
May 30, 2011
Now with less breakage.

Depends on the realtek nic - some of them work perfectly fine.

On the other hand I have an intel PCIe NIC in my FreeBSD NAS box for similar reasons, so maybe I shouldn't say anything.

Computer viking
May 30, 2011
Now with less breakage.

BlankSystemDaemon posted:

Speaking of notifications, it's important to remember that email is not a stable protocol for notification delivery; you need something that'll do push notifications to a process running on your laptop/desktop or mobile device.

It needs to be entire independent from the set of MTAs, MDAs, and MUAs that, together with almost infinite amounts of best effort, wet duct tape, and fraying baling wire hold together internet mail.

While keeping in mind that at some parts of the mail infrastructure sees enough use that people notice it failing, while almost every other solution has a similar amount of jank but less user exposure.

Computer viking
May 30, 2011
Now with less breakage.

It looks like WD has moved to Red drives being SMR, while Red Plus and Pro are CMR.

If you're not familiar, the short summary is that CMR is "normal", and SMR is more space efficient (so you can get more TB per platter) but requires you to rewrite long stretches of data if you want to change them. SMR drives typically have a CMR region to buffer incoming writes and then quietly moves things around in the background, so you may not notice if you only write moderate amounts of data at the time - but they get dog slow if that buffer region fills up before it has a chance to drain. (Reads should still be fine, though). They are a cheaper way to get more storage, and given your use they're probably fine. Personally I'd still get a CMR drive just in case, though - they're not that much more expensive.

Red Pro drives mostly seem to be faster: 7200 rpm instead of the ~5600 of a Plus means lower latency and more throughput, but louder and warmer. Probably not important for you.

Computer viking fucked around with this message at 01:55 on Jul 13, 2022

Computer viking
May 30, 2011
Now with less breakage.

Yeah, all my experience with SATA port multipliers suggests that it's supremely cursed technology. SAS has the crucial benefit of actually working with no real fuss. You may have to mess with the controller settings to get true JBOD (that is, just present the disks as-is to the OS), but it should support SATA drives just fine.

Computer viking
May 30, 2011
Now with less breakage.

Something like this as the enclosure (I have no idea if that's a good choice, but just to illustrate the category), plus some sort of LSI SAS HBA with external ports (the LSI SAS 9300-8e seems to be recommended, but may be overkill - a 9200-8e may be more than enough). And whatever cable is appropriate, of course; SAS cables are their own fun thing. Most likely this will be an SFF-8088 cable or an SFF-8088 to SFF-8644.

Computer viking fucked around with this message at 19:40 on Jul 20, 2022

Computer viking
May 30, 2011
Now with less breakage.

This is incidentally how the file servers at work ... work, except that both the server and the SAS enclosures are 2U units from HP or Dell. LSI adapters in IT mode, server and enclosure full of disks, ZFS.

I have my share of problems at work, but that part of the hardware has so far not been among them - it's strangely painless.

Computer viking
May 30, 2011
Now with less breakage.

fletcher posted:

I thought I wanted hot swap bays. Then I realized what a nightmare the thermals are in all of the cases that support them, and realized that I didn't actually need hot swap support outside of the novelty of it. My Node 804 is nice and quiet and my drives temps are great!

They're fun in the enterprise gear I have with them, but ... I've never needed to hotswap a drive.

Computer viking
May 30, 2011
Now with less breakage.

Ha, it feels like the overlap between maxtor drives, noctua fan, that case, the non-modular PSU and the red SATA cables is enough to date this build very precisely.

(I'll guess 2006?)

Computer viking
May 30, 2011
Now with less breakage.

Mr. Crow posted:

:barf:

Is that bolted to the floor too?

It's bolted to a piece of wood that's only slightly lighter than the floor.

Computer viking
May 30, 2011
Now with less breakage.

Sometimes I hate enterprise hardware.

Oh, you dare put a third party hard drive (that's identical to the certified ones we sell for a 5x markup) in our server? 11000 rpm fan speeds forever as a punishment. That IPMI command to knock the fan speeds back down that we've sometimes mentioned? Naah, we locked those out in a firmware update.

Next time I'm buying supermicro, preferred suppliers be damned.

(Dell R550.)


E: Ah yes, the same goes for adding a third party PCIe card - they even have a specific guide on how to disable their paranoid/punishing cooling algorithm for those that doesn't work on the newest generation.
E2: Ah, the racadm command to disable the PCIe fan speed thing still works. Shame about the drives, but I think I can move enough non-Dell drives to the external DAS.
E3: Nevermind, with the PCIe cards calmed down the minimun fan speed with third party drives jumps to 36% for a bit but then falls down to 13% again on its own. Huh.

Computer viking fucked around with this message at 15:33 on Aug 10, 2022

Computer viking
May 30, 2011
Now with less breakage.

Aware posted:

Somewhat ironically adding a second eBay CPU and filling out all the fan slots on an R740xd actually made the drat thing quieter overall. Either way don't buy Dell servers for home is my advice now.

It's actually at work, but yeah, same thing applies. If I had a dedicated server room and the budget to pay for an all-Dell setup, I'm sure it'd be a great piece of kit. That said, I don't really have a lot of great alternatives; HPE are worse, and I'm very limited in suppliers. They have a couple of Lenovo and Fujitsu servers that don't really work for my purpose, and one or two annoyingly outdated supermicro parts. I can order anything from Dell, though, so ... R550 + MD1400 it is.

And yeah I fully believe you; the fan speed algorithms on these machines seem to be mostly magic.

Computer viking
May 30, 2011
Now with less breakage.

I did think about that, yeah - though disconnecting one didn't seem to change anything. As of right now the MD1400 box is noisier, which I guess is good enough.

As for the MD1400, if you should ever need to deal with one: the SAS connector units at the back have debug USB mini ports (in 2022). If you plug it into a PC, they present as USB serial adapters - with two ports, though only one seems to do anything. Connect at 115200 baud, and you get a console asking for a password. Dell does not give that password out, but there is exactly one reddit post and zero other pages that have the right one. In the interest of doubling that count, its "bluemoon".

You can then use the help command - it accepts the same commands as the earlier MD1xxx boxes did over serial, including "shutup 20 0" to set fan 0 to 20%. Any lower and it tends to go back up to 50% on its own shortly afterwards. There are two fans - one in each PSU - numbered 0 and 1.

As far as I can tell it does not matter which of the two SAS units you connect the USB cable to.

Beware, though - it looks like there's some foot-pointing guns scattered around in there.

Computer viking fucked around with this message at 01:25 on Aug 11, 2022

Computer viking
May 30, 2011
Now with less breakage.

I guess you could also use gstat - if the queue lengths start climbing above 1, you are probably bottlenecked by the disks?

Computer viking
May 30, 2011
Now with less breakage.

Do all consumer 2.5gbit NICs suck? We have two intel i225 cards in different machines, and both have a coin toss chance of just not working on any given boot. The realtek rtl8125 on my motherboard just doesn't work at all, which seems to be a common problem across multiple motherboard manufacturers who use it. I see the embedded intel i225 on the Mikrotik router I'm waiting for has a dedicated forum thread about its issues, too.

I happen to have a 2.5gbit switch, so I optimistically thought that the NIC side should be a solved issue by now, but it seems kind of dire.

Computer viking
May 30, 2011
Now with less breakage.

BlankSystemDaemon posted:

2.5G is a bad stopgap solution for people who have RJ45 already embedded inside wells, and can't easily replace it with fiber.

Sure, but I would have expected the problems to be "it's hard to get full speed over most cabling" or "it uses too much power", not "the hardware, firmware and drivers all seems to have been made by the less competent interns".

Computer viking
May 30, 2011
Now with less breakage.

The realtek or the intel?

Either way, I guess that is promising - it proves it's possible to make it work well.

Computer viking
May 30, 2011
Now with less breakage.

Worst thing is, grub2 has support for reading a whole bunch of file systems and is (as far as I can tell without trying) designed to make it easy to plug in more. They just like their convoluted initrd designs over in linux land, I guess. :shrug:

Computer viking
May 30, 2011
Now with less breakage.

Hughlander posted:

I don’t understand. I’ve been doing proxmox with zfs boot on Linux for the past 5 years. What’s lacking?

Nothing, if it works it works. It's just not a given on the larger linux ecosystem - IIRC, Fedora routinely breaks ZFS if you install their recommended kernel upgrades.
Also, a more zfs-first OS may have some neat extra tools. The boot environments mentioned are basically the opportunity to make clones of the boot drive before upgrades (or indeed at any point you want), and boot from any of them or roll back to them at will. It's possible to make work on linux, it's not the end of the world to not have it ... but it is neat.

Computer viking
May 30, 2011
Now with less breakage.

My impression of Microsoft and consumer file systems is that they want to treat Windows PCs like fat clients, with the primary storage being Onedrive, or a NAS for business desktops. If local disk is only really for impersonal data (software which can be downloaded again) and checked out copies from cloud/NAS, then there's less incentive to develop a fancier new file system.

Computer viking
May 30, 2011
Now with less breakage.

Truenas is perfectly fine for this - both Core and Scale should let you configure containers for things like that fairly easily. I use Core (the FreeBSD one), and I think most of the things you listed are available as plugins (e.g. preconfigured jails), and the rest should be doable. I bet they are mostly available as containers on Scale (the Linux based one), too.

As for their Storj partnership, I bet it's way overblown for marketing and in practice means they're an option in a dropdown somewhere.

Computer viking
May 30, 2011
Now with less breakage.

My Truenas (core) machine is my old i5 6600K gaming machine with extra RAM, a pair of large spinning disks, and an M2 SSD left over from an upgrade. It's fine.

The one concession I've made is to throw an AliExpress Special™ Realtek 2.5Gbit network card in it - I got 190MB/s from the steam cache we've set up earlier today, so it seems to be working ok.

Computer viking
May 30, 2011
Now with less breakage.

I fully expect there to be at least one "I turned a battery pack from a crashed Tesla into a UPS" video on youtube.

But yeah, you'd think putting a bunch of 4680 cells and a new charge controller into an existing UPS design, and selling it for a non--silly price, would be within the capabilities of most companies.

Computer viking
May 30, 2011
Now with less breakage.

Just go full industrial and install a motor-generator set in the basement; with a suitable flywheel it should be able to smooth out a lot of noise and transient events.

(Note: Do not do this)

Computer viking
May 30, 2011
Now with less breakage.

Klyith posted:

I've never understood why twice as much of a thing costs more money. Just an amazingly hosed up state of affairs when you think about it.

That kind of made me curious about how much of the cost of a consumer grade UPS is the batteries. Looking at cyberpower, and rounding a bit, their $200 UPS has $100 replacement batteries. Which suggests that twice the capacity would cost $300 plus whatever it costs to ship the more expensive unit, build the larger chassis, design and support another variant, and whatever modifications are needed to actually charge and use the extra cells. Plus the risk of running into the regulations you mention, of course.

Basically, it seems like it's not "twice as much", it's "twice as much of something that makes up about half the price of the unit, plus a complicated overhead". Which I guess in a way is the answer to "why does it cost twice as much to double the capacity".

Adbot
ADBOT LOVES YOU

Computer viking
May 30, 2011
Now with less breakage.

BlankSystemDaemon posted:

You'll want a disk shelf with a SAS connector and a SAS HBA with external ports, as SAS is compatible with SATA.
The used market should have plenty of 2.5" disk shelves in 2U rack size.

Though be careful, this is apparently not 100% - I just found out the Dell MD1400 somehow manages to not support SATA disks.
e: According to the internet. I've got one I'm not actively using yet, so I can throw a SATA disk in it and see what happens.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply