Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop
What's been people's experience with sata port multipliers? How reliable are they and how much of a performance hit should I expect if I use them?

You can get a 5:1 multiplying backplane (no trays, just a circuitboard) for $45$60, and making slide-trays for that would be trivial in a shop. I know Backblaze uses them extensively, but their business model is massive amounts of low-performance storage space.

Alternately PCIe extension cables and locating the SATA cards directly in the external box, but that seems both excessively hacky and anything over 4-port cards goes in the multi-hundred dollar range pretty quickly, which would chew through PCIe slots rapidly.

Edit: Two corrections, $45 is the bulk price, so my small setup would be the $60ish retail, and also expanding the question to SAS JBOD cards + expanders if they're cheap as well.

Harik fucked around with this message at 03:27 on Apr 27, 2013

Adbot
ADBOT LOVES YOU

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

Isn't hotswap overkill for home use? I can live with my DVR being offline for 10 minutes while I put in an RMA drive. I'd rather take the savings and put it into more storage.

I'm leaning towards the R4 simply because my current NAS is the loudest piece of hardware I have; all my other systems are 120mm fans + SSD so 7 (very old) spinning disks make a racket. A case lined with sound dampening foam is really attractive to me.

In storage-oriented "fun", I'm in the process of, not really rescuing; more like rehabilitating a long-serving 6TB BTRFS setup. I let it fill up (zero allocatable chunks remaining, all writes done to partially-filled chunks) and stay there for about 16 months, with some minor churn as DVR'd shows expired and new ones were recorded. Basically, pathologically worst-case scenario for btrfs.

It came to a head a few days ago, when the free extent table was finally fragmented enough to not all fit in ram. At that point it basically exploded, locking all RAM with partial transactions and evicting every user process until the whole system crashed.

Current rehab process is picking a set of files to offload to free up a significant amount of space, then removing them one-at-a-time with a sync in between. The sync was to limit it to one-transaction-in-flight at a time, which while slow was still possible.

That accomplished, that freed-up space needs to be converted to un-allocated chunks, so a full balance run. As soon as that's done, it's time to find the worst-offender fragmented files (sqlite databases, mostly), defragment and mark them nodatacow.

I'm OK with the way it's worked out - I went very far down the highly-warned against "do not ever do this" path and can still recover from it. I think had I been running with the now-available online defrag I'd have never seen this issue. I had to move my DVR writes to a different disk, but while rebuilding I can still watch what I had previously recorded.

Realistically the 'fix' here is going to be double the disk space. Use btrfs-send to copy everything (with snapshots intact) to a fresh array of the same size, then wipe and add the current storage to that. For shits and giggles, throw in a SSD and try out bcache as well, see how well that works.

I could try ZFS but this old machine has about half the recommended RAM per tb for it. Even when I upgrade, I'm likely to be putting 12-14tb of storage on an 8gb server, which is below what they want. Has anyone pushed ZFS in the same kind of worst-case way? How does it handle lots of churn with no free space?

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

spoon daddy posted:

If you are going to hit up ebay for a m1015, do note the m1115 uses the same chip and is cross flashable with the IT firmware. It also can be cheaper than the M1015 since most people just search for the m1015. I've got a M1015 and M1115 both working side by side in the same system using the same firmware.

Speaking of alternates, I'm assuming DELL/HP etc have versions of the same LSI card, what are their part numbers to watch ebay for?

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

Toshimo posted:

So, troubleshooting some unrelated items last week, I reset my CMOS and which cause my onboard Intel RAID controller to take a giant steaming poo poo all over my array's metadata. I tried a few things, but was unable to resuscitate it. Now, I've given myself over to recreating, reformatting, and refilling my array. One thing I can't seem to find anything on is if there is a way to actually back up the metadata so I don't ever have to go through the seven stages of loss for my 9TB array with 15 years of data on it again. Any help?

I've had to deal with blown arrays over the absolute stupidest poo poo - client shutdown and moved their setup (rackmount server, 2 rackmount drive arrays, SCSI). Swapped the connections to the controller when they reconnected it and lost everything. I mean, I guess I could have understood it complaining until you swapped them back, but no, it just decided to resync the array with the disks in the completely wrong order, which of course meant it trashed everything recalculating "parity" with the wrong bits. Thanks, LSI.

I just trust (linux, BSD or zfs) software raid more. At least I know exactly what it's doing, and a controller failure just means I replace the hardware, not restore from backup.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

eddiewalker posted:

Tight? I thought the packing looked fine.

Static bags inside the stiff little bubble-suits with each drive in its own corrugated box.

Yeah, I got 4 WD REDs from amazon (double checked that it was Amazon and not some lovely storefront), and they were in a plastic spacer in their own box, packed into a larger box and surrounded by padding.

I've had the same bad luck with newegg OEM drives as the rest of the horror stories here, I don't get why they don't invest in some kind of cheap standard packaging. It's not like 3.5" drives are going to change size on them.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop
Out of curiosity - anyone moving data off their NAS at better than Gig-E? Borderline question between here and the networking thread, but I figure the people here would be more relevant to what kind of performance you can expect from ebay'd gear. Specifically looking at 40gb infiniband gear because it's so cheap, and since the use is linux to linux use NFS/RDMA.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

admiraldennis posted:

Yeah - I use a 40GbE link between my NAS and my gaming PC (pair of Mellanox ConnectX-3); and a 10GbE link between my NAS and my MacBook Pro (using Chelsio T520-CR - has great Mac drivers; there's no cheap used market for 40GbE cards for Mac). All eBay gear. I'm using Ethernet not Infiniband.

If you're just streaming media for consumption, Gigabit is totally sufficient, even for 4k. But if you want to access project files (e.g. video editing, video game assets, etc), run games, run VMs, etc directly from your NAS, 10GbE minimum is much nicer. I run all of my games (and large projects) directly off of the NAS, and it's pretty great overall.

Agreed there, most of my streaming is over wireless to fire sticks so obviously gig-e is sufficient for that. It's the project work (huge datasets/video editing/etc) that I want to speed up.

admiraldennis posted:

My pool consists of 2x raid-z2 of 6x 8TB WD Red Drives. FWIW, CrystalDiskMark numbers over 40GbE look like this:

Holy hell, those sequential speeds approach my NVMe. Gets killed on random, as expected, but that's some massive bandwidth. Thanks for confirming it's viable.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop
I'd like to add that the space used is almost entirely about the retention length, not density of snapshots. I haven't used ZFS, but most snapshot systems have very little overhead, it's almost entirely taken up by files you've deleted since then.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop
:ohdear: I hope I didn't make a mistake.

Bought a X8SI6-F for $80 - because it came with the X3450, heatsink and 8GB registered ECC tested working together.

I think that's about the same price as those components separately? I'm just worried because it seems like a good deal but the listing was nearly up and nobody else bought it.

https://www.ebay.com/itm/123193214900 was the listing, dunno if anyone but me can see it.

Should be a healthy upgrade from the Athlon 64x2 that my NAS was running on.

What cable do I need to go from the SAS->SATA3 drives? NVM it's a CBL-0097L-02 and they're cheap.

Harik fucked around with this message at 19:01 on Jul 15, 2018

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

IOwnCalculus posted:

I paid a similar price for a non-F version of that exact same setup. It's a solid deal, and the onboard controller flashes to IT firmware no problem.

What's the SI6 non -F? I only see non-F for SIE and SIA.

Definitely want to put more RAM in this puppy so I can offload services from my desktop back to the NAS where they belong. My main reason for buying it is I'm absolutely maxed out on capacity on my current setup. I've got drives hanging off a PCI slot, so I'm actually bandwidth bound. I shuffled it to only have the 10TB backup disk on there, but still. It's a PCI SATA because I need two gig-e ports so there's a dual gig-e sitting in the only PCIe expansion.

Upgrades from here:

24gb of RAM. It's got 1600 11-13, theoretically I could run that at 1333-9-11 since the latencies are absolute, not in terms of clocks? That's if I can convince the board to apply custom timings, of course. (11/16*13 = 8.9)

high-endurance SSD for cache (PCIe or a 512gb samsung pro sata)

More drives now that I have the ports for them. What's the current thread favorite for 8 or 10TB NAS disks?

10 GBE. I'm wired cat-6a, I want to make use of it.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop
Anyone using BTRFS in RAID1 mode? I think it works something like unraid, in that it picks two disks of the pool to write the data two and balances the unused space. This server is old enough that it's BTRFS on MD RAID5 but what the hell, layering violations are the name of the game for "modern" filesystems, right?

E: Re-asking this from earlier:

Harik posted:

More drives now that I have the ports for them. What's the current thread favorite for 8 or 10TB NAS disks?

I'm running 4TB reds right now, but I wouldn't mind moving to 7200 RPM now that my controller and network can handle the throughput. In the big drive space there's the Deskstar NAS, Ultrastar HE10, Toshi X300 and Western digital gold. Well, and seagate but no.

I'd prefer not to shuck for 5400 reds, but I will if the price/tb is significant.

Harik fucked around with this message at 01:29 on Jul 16, 2018

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop
Ok, there's a bunch of people here with the X8SI6-F: The gently caress is up with the RAM situation? Apparently intel completely shat the bed that generation because what works/doesn't work is stupidly complicated. Theoretically mine is coming with 8GB of ram tested, except the modules are 4GB 1R - no 1156 works with more than 2gb/rank.

I'd like 24 or 32 in there, but not if I have to pay stupid amounts for special low-density multi-rank parts or run at half the memory speed.

So, help? What RAM can you actually put in this thing. I'd like to order now so it'll be here at the same time as the board next week.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

Devian666 posted:

I don't want to comment on ram as I hosed up an order at the end of last year for my super micro atom server. Whatever you order make sure it meets the requirements on the specification exactly. I tried using non-ECC memory when the specs said ECC and the bios spat an error.

:dogbutton:

IOwnCalculus posted:

I never upgraded mine. The manual claims registered RAM will work but I think it really only wants unbuffered. Either that or I just got the wrong ranking. The RAM I mistakenly bought for it works fine in both my X8DT6 and X9DRW, though :v:

:smith:

Unbuffered ECC "works" but limits you. It really wants registered. You probably got the wrong ranks because intel. Maybe. Or it wanted a newer BIOS.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

EVIL Gibson posted:

For my server I did go with a USB stick initially but sticks has a serious problem of wear. Much limited write compared to SSD.

Caching on the stick was a major concern enough I went to SSD but I hope it was just my paranoia and not real.

It's a major concern. I burned the entire life expectancy of an 850 EVO doing caching.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop
Quick followup - in trying to get the X8SI6-F running I hit a snag. The on-board SATA does not work in BIOS, at all. I had to turn on the BIOS option for the SAS and put my boot media on there in order to even boot.

It must have worked at some point, as it came with one of those 16gb high-endurance SATA-socket SSDs installed in it, but that wasn't recognized at boot either.

It's booted, but it takes forever to start running the SAS 2008 ROM, I'd prefer to disable that and let it come up as the server starts.

(Also, I'm out 6 SATA ports if I can't figure this out)

At least updating the IPMI and cross-flashing the SAS to IT mode went smoothly.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

H2SO4 posted:

Have you tried booting with that SATADOM unplugged? I've had misbehaving drives make an HBA unhappy at boot before.

Yes, it's a wierdly shaped little bugger (flat with a SATA connector on the face) so when it's plugged in it blocks most of the SATA bank. I had to pull it to use the SATA ports, although I could re-install it so it only takes up 2.

The SATA does work after boot, I was able to access it and found it was a pull from an old (retired jan 2015) solaris based readynas. Reading a random ZFS partition was interesting, but ultimately it only had a single old 1TB disk in it to test that everything still worked. It did get me to try out zfsonlinux at least.

H110Hawk posted:

Do a physical cmos reset (from the motherboard. Make sure to read the whole process as some you keep the battery installed and some you remove.) remove all devices and see if it shows up. From there start adding devices back in starting with the satadom or a regular hard drive from the shelf.

You might also have a bad battery. Those things cause the strangest errors when the voltage is low. It's a dollar for a new one and you likely should do it with any used server right off the bat. Hardware stores sell them.

Unlikely it's the battery, since CMOS throws a checksum error when the battery lets the data corrupt for as long as I can remember. It's in-service now, but I'll try a full cmos reset next time I work on it. Can't hang the SSD off the SATA anyway because they're only 3gb, so it'll be for the 16gb flash it came with as boot/root. Pretty sure I have a spare battery too, I'll throw it in just because it's easier to do that now then after it dies.

Also I'm 99% sure it's running RAM the CPU is supposed to be physically incapable of using. No idea how that works, but both intel and supermicro insist it can't take 4gb rank-1 dimms, yet that's what the datasheet says.

Am I reading this wrong? Because 4gb/rank would be amazing, I could put a cubic fuckton of ram in this so cheaply.

E: IPMI is such a godsend, the server is now in it's own separate room at the other end of the house. It's in the corner of our oversized walk-in closet where the network/security drops all terminate. Balancing a monitor/keyboard in there sucked.

Harik fucked around with this message at 16:44 on Aug 7, 2018

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

IOwnCalculus posted:

Depending on the box I'm using I'll either use the Supermicro IPMITool local client, or I'll just connect to it using a web browser. I have one server in my work lab that refuses to load the KVM console in the local client but works fine via browser :iiam:

Try updating the IPMI firmware (if the lab server is something you can risk doing a firmware update on). The KVM console popped up and insta-closed on this board until I did that.

Surprised Java doesn't explode at the self-signed cert the JWS file points at. I had to add security exceptions and got a big popup on the web client whining about it.

H110Hawk posted:

:allears:

While that is the failure mode that is written in the manual, and perhaps what you have experienced, I can tell you from way more than anecdotal experience that those batteries cause the weirdest errors. This is across several makes (SuperMicro, ASUS, Quanta, Dell off the top of my head), probably a hundred models, and conservatively 20k servers.

Interesting. I've never seen it fail in any other way than a checksum error. Either way I'll have downtime to add more RAM when it arrives so I'll be putting a new battery in then.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop
Why are people transcoding in 2018? I haven't had a media box in years that can't play anything i throw at it.

Kodi on fire TV, built-in player on my roku, etc.

Plex shouldn't be a problem just streaming files as-is.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

IOwnCalculus posted:

Might be a 15% coupon on eBay tomorrow too.
I keep missing these. How do you guess that there might be a code upcoming? Do they do it sometime around quarterly or what?

RAM trip report: RIP experiment with 2Rx4GB. I now have 2 8GB DDR3R-1333 ECC sticks to put back up on ebay. Not sure why hynix 1rx4 works when it's not supposed to, but I'll fill my slots up with that and deal with "only" 24GB. intel. :argh:

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop
Couple shucking questions - where can I find the price history for the 8/10tb easystores from bestbuy? I missed all the sales and they're 200+ now, trying to guess when they'll drop again.

Secondly, the elements 8tb - I understand that can be a HGST DC320 (without helium and lowered RPM?) or an EFAZ... They're $150 now which is fine if the drives are decent. I understand it won't be the 256MB cache drive you might get in an easystore.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

vodkat posted:

Whats the safest way to connect to and manage by synology box over the internet?

VPN in to manage it. There's been a number of remote exploits so you don't want any way to access it without a VPN.

quote:

User controlled input is not sufficiently sanitized, and then passed to
execve function.

Successful exploitation of this vulnerability enables a remote
unauthenticated user to run commands as root on the machine.

That was one year ago, but who knows what else is lurking in the current build?

Harik fucked around with this message at 14:24 on Dec 21, 2018

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop
The 8 went down to 170 which makes the 10 a no-brainer right now if you need the space.

What's the low-end of the 10s? I know 8s go as low as $140, usually $150 on sales.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop
Three freshly shucked 10TB easystore all yielded whitelabel EMAZ.

Wasn't bad but I had to use a jewelers flathead to convince them to start moving at the spine after I did the shimming, a cut-up loyalty card just deformed rather than motivating it. For possible RMA purposes, I'm just tossing the USB circuit, bumpers and screw back in the clamshell and closing it loosely, then throwing the boxes in my garage. Better way or that about cover it?

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

You're all so loving welcome, because of course I just got easystores yesterday at $200 each and you can't price-match between them.

What's the easystore premium buy you, chance at helium?

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop
I once removed a dirty bcache when it was wedged instead of hitting the reset switch and letting it do recovery, so of course there went all the latest metadata of my btrfs.

Being that kind of stubborn idiot, I proceeded to write and contribute a fairly significant set of changes to the btrfs recovery tools to reconstruct a best-available tree and dump it.

What I'm saying is these posts are giving me PTSD.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop
Anyone tried refurb enterprise drives? They're a lot cheaper, raid 10-instead-of-five cheaper sometimes, but I don't know how often I'd have to be replacing them.

30T array filled up to 400G free (media duplication issue. Either it got redownloaded or copied instead of moved) but it's time to make the wallet cry again.

In other news, ZFS on a set of cpu-attached 4TB u.2 drives is ludicrously fast. I've only got 6 right now but the server can take 12. Much smaller pool than my NAS but the stuff running on it doesn't ever get IO starved.

I'm not even tuned properly but I'm getting north of 100k 4k random read iops, 50k write and >5.2gB/s sustained reads on the zvol.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

FCKGW posted:

Mostly just open-box returns and the like.

not open box drives for these, you get them at dedicated storefronts that sell 1u servers and whatnot. Enterprise, has the 3v sense line or whatever that you have to mask off, sometimes 10krpm instead of 5400/7200...

BlankSystemDaemon posted:

You might be interested in this:
Absolutely, thanks for that.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

Nitrousoxide posted:

A read/write ssd cache can help alleviate the abysmal performance on SMR drives. At least during uploads. They'll still be working constantly in a ZFS/BTRFS array shuffling bits around with their lovely write speeds. But at least when you are wanting to upload stuff smaller in size than your cache it'll be done quicker. Though it obviously increases cost, and as I understand it with Synology, the write cache can be a risky thing to enable. That said, if you're just getting like a pair of 250gb drives the price really is not bad.

Edit: I have a 500gb read cache on my Synology NAS which has 4 SMR 8tb drives. It has noticably helped speed up recall times from the drives with reads even. Since the heads are constantly busy catching up on needed writes. So the NAS can pull from the read cache 80% of the time for requests which frees up a lot of head time for writes.

A lot of caveats here:

* You will burn up any consumer SSD as a write cache for a NAS. Been there, done that. You need enterprise U.2/SAS/PCIe SSDs with endurance ratings in DWPD (Drive Writes per Day) with a 5 year warranty. A consumer SSD can be written for 100 - 200x its capacity before failing. Enterprise SSD endurance ratings start at around 2000x the capacity and go up from there.
* if when your write cache SSD dies you lose all data on your NAS. It breaks ordering guarantees for the underlying disks because the SSD is effectively the underlying disk now. A write is 'complete' when the SSD accepts it, so when it goes the array is completely inconsistent and good luck recovering it. Again, been there done that contributed the changes to the recovery tools I had to make to dig myself out of that mess. Definitely doing a raid1 on the write cache next time.
* You can avoid both of those by sticking to a read cache, but that does almost nothing to help write speeds on shingles since writing a single track requires one read and two writes to "layer" the changes. Agree with your edit though, it does help significantly with contention.

If you're acquiring linux ISOs via usenet or torrents to store on your NAS I suggest using a SSD where the downloads 'land' and using the postprocessing scripting of your downloader to copy them to the main array later. One long write is a lot better than all the random writes it has to do when downloading, and if that SSD dies you only lose things you were in the middle of downloading anyway.

BlankSystemDaemon posted:

All of that can still result in reads, from the point of view of an individual drive, looking like random I/O.

I think we've got two things conflated here: spinning rust in general hates random IO because it has to physically move objects with mass to seek, and SMR being somehow even worse because on those work the same way as flash: append-only blocks that have to be erased before they can be rewritten except now its done at the speed of physical media instead of solid state.

It's a lot of words but the tl;dr is SMR is complete dogshit and if you get one that wasn't marked as such demand a refund.

Why is ZFS reslivering random IO though? It should be contiguous with the only seeks being to hop over empty blocks. I'm missing the benefit of having a given block of data not being at the same position on all the drives involved.

E: oh NVM lmao it's so awful why on earth would anyone think it was a good idea?
https://blogs.oracle.com/solaris/post/sequential-resilvering

Premature optimization: not even once.

Harik fucked around with this message at 07:28 on Jul 15, 2023

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop
My old NAS is getting fairly long in the tooth, being cobbled together from a recycled netgear readyNAS motherboard (built circa 2011, picked it up in 2018) and a pair of old xeon x3450s.

It worked out pretty well but the ancient xeons limit to 8GB RAM is preventing me from offloading much unto it aside from serving files, so my desktop ends up running all my in-house containers. Not a great setup.

Looking at something like an older epyc system (https://www.ebay.com/itm/175307460477 / https://www.supermicro.com/en/products/motherboard/H11SSL-i) But I'm curious of anyone else has run across other recycled gear that's a good fit for a NAS + VM host.

Also, has anyone used PCIe U.2 adapters, such as https://www.amazon.com/StarTech-com-U-2-PCIe-Adapter-PEX4SFF8639/dp/B072JK2XLC ? I've had good luck with PCIe-NVMe adapters so I'm hoping it's a similar thing where it just brings out the signal lines and lets the drive do whatever.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

BlankSystemDaemon posted:

Sure, spinning rust hates random I/O, but SMR is still many times worse at it than regular harddrives - because the non-volatile flash that it uses to speed up regular I/O when using a traditional filesystem without RAID is simply masking the terrible performance that SMR has for anything but sequential access.
Random IO or random write? There shouldn't be a spectacular difference on random reads given servo and platter technology, it's the writes that are so brutally inefficient.

quote:

I can't open that link, but you should be aware that OracleZFS and OpenZFS are not the same anymore; not only did they start diverging back in 2009, but at this point way more than 50% of the shared code has been rewritten, and a lot more code has been added.
All of which is to say that however it works on OracleZFS, it's irrelevant unless you're one of the unfortunate folks that're locked into funding Larry Ellisons research into sucking blood out of young people to inject into himself.



quote:

In the initial days of ZFS some pointed out that ZFS resilvering was metadata driven and was therefore super fast : after all we only had to resilver data that was in-use compared to traditional storage that has to resilver entire disk even if there is no actual data stored. And indeed on newly created pools ZFS was super fast for resilvering.

But of course storage pools rarely stay empty. So what happened when pools grew to store large quantities of data ? Well we basically had to resilver most blocks present on a failed disk. So the advantage of only resilvering what is actually present is not much of a advantage, in real life, for ZFS.

And while ZFS based storage grew in importance, so did disk sizes. The disk sizes that people put in production are growing very fast showing the appetite of customers to store vast quantities of data. This is happening despite the fact that those disks are not delivering significantly more IOPS than their ancestors. As time goes by, a trend that has lasted forever, we have fewer and fewer IOPS available to service a given unit of data. Here ZFSSA storage arrays with TB class caches are certainly helping the trend. Disk IOPS don't matter as much as before because all of the hot data is cached inside ZFS. So customers gladly tradeoff IOPS for capacity given that ZFSSA deliver tons of cached IOPS and ultra cheap GB of storage.
And then comes resilvering...
So when a disk goes bad, one has to resilver all of the data on it. It is assured at that point that we will be accessing all of the data from surviving disks in the raid group and that this is not a highly cached set. And here was the rub with old style ZFS resilvering : the metadata driven algorithm was actually generating small random IOPS. The old algorithm was actually going through all of the blocks file by file, snapshot by snapshot. When it found an element to resilver, it would issue the IOPS necessary for that operation. Because of the nature of ZFS, the populating of those blocks didn't lead to a sequential workload on the resilvering disks.

So in a worst case scenario, we would have to issue small random IOPS covering 100% of what was stored on the failed disk and issue small random writes to the new disk coming in as a replacement. With big disks and very low IOPS rating comes ugly resilvering times. That effect was also compounded by a voluntary design balance that was strongly biased to protect application load. The compounded effect was month long resilvering.
The Solution
To solve this, we designed a subtly modified version of resilvering. We split the algorithm in two phases. The populating phase and the iterating phase. The populating phase is mostly unchanged over the previous algorithm except that, when encountering a block to resilver, instead of issuing the small random IOPS, we generate a new on disk log of them. After having iterated through all of the metadata and discovered all of the elements that need to be resilvered we now can sort these blocks by physical disk offset and issue the I/O in ascending order. This in turn allows the ZIO subsystem to aggregate adjacent I/O more efficiently leading to fewer larger I/Os issued to the disk. And by virtue of issuing I/Os in physical order it allows the disk to serve these IOPS at the streaming limit of the disks (say 100MB/sec) rather than being IOPS limited (say 200 IOPS).

So we hold a strategy that allows us to resilver nearly as fast as physically possible by the given disk hardware. With that newly acquired capability of ZFS, comes the requirement to service application load with a limited impact from resilvering. We therefore have some mechanism to limit resilvering load in the presence of application load. Our stated goal is to be able to run through resilvering at 1TB/day (1TB of data reconstructed on the replacing drive) even in the face of an active workload.

As disks are getting bigger and bigger, all storage vendors will see increasing resilvering times. The good news is that, since Solaris 11.2 and ZFSSA since 2013.1.2, ZFS is now able to run resilvering with much of the same disk throughput limits as the rest of non-ZFS based storage.

The sequential resilvering performance on a RAIDZ pool is particularly noticeable to this happy Solaris 11.2 customer saying It is really good to see the new feature work so well in practice.

Given the description of the problem with openZFS it sounds like they followed Oracle down first the wrong path than copied their fix.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

Kivi posted:

tugm4770 is legit awesome seller, have bought twice from him and the communication and goods have been superb. I bought H12SSL-i and 7302p combo from him, and later upgraded that 7302 to 7443 and both transaction went without hitch. Surprisingly good on power too.

That's a great platform to build your NAS on. I'd pick H12 series MB because
1) upgrade path to Milan at some point (20-30% perf uplift, no more weird NUMA)
2) PCIe 4.0 probably won't make a difference but you know, more bandwidth never hurt?

well that's cool, I wasn't expecting someone'd done the exact thing I was thinking of. Cool. I've gotta decide on a few other bits of the build and figure out what my budget for more drives is. Just missed the sale on 18TB red pros damnit, $13/tb.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

Rexxed posted:

Yeah, it seemed like an issue with fairly new units at the time so I figure my older ones are probably fine since I've had most of them 5-15 years at this point, but I've bought a couple of APCs in the last three years or so. The only thing I don't like about APC is that APC seems to be stuck on the idea of selling replacement batteries as a cartridge which is just two normal 12V SLA batteries taped together to a connector that connects to the unit as a plug. That way they can sell you $40 (each one costs about $20) of batteries for $80. You can just buy two batteries and replace them on the connector and put some packing tape on the sides as a replacement. It's one extra step that cyberpower didn't bother to upcharge for, although they'll sell you cyberpower approved batteries for double the normal price as well. I just get 8x mighty max 12V 9Ah for $160 on amazon or whatever every two or three years since I have a lot of UPSes.

that's the same thing? or you mean that cyberpower doesn't bother with the double-sided tape on the batteries?

I had a business account at batteries+ that brought their prices down close to amazon level and every time I replaced APC they happily taped them together and transferred the connectors for me. It was worth it since A) same day replacement and B) I'd have to drop off the old ones for recycling anyway.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop
but that was some of the most entertaining posting in months, IE!




still working on my NAS refresh, looking for some case options. 6-8 20TB drives, maybe 10-12 eventually. ATX for one of the supermicro epyc boards discussed a couple pages ago. was hoping for something not quite as pricey as the define XL but if that's the best option I'll just go with that. Trying to keep it under 3 grand but lol, $2100 in storage alone for 100tb. Hopefully some good shuck deals come around.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

VostokProgram posted:

Btrfs can do arbitrary disk mixing, but it has the write hole problem :whitewater:

do not use btrfs raid for any reason. there is no btrfs raid mode that won't gently caress up your data sooner or later. btrfs on a raid array is fine and works extremely well. That said, I'm using ZFS on my next pool and retiring at least the smaller btrfs array and possibly the larger as well.

speaking of bad ideas: refurb seagate exos for $10/tb? 1 year ebay warranty and reseller claims to give you 5 years (lol)

https://www.ebay.com/itm/155636746868

40% off buys a lot of spares even if the 5 year warranty is bullshit.

BlankSystemDaemon posted:

I think Linux has more attempts at fixing the Zen 1 errata (of which there is quite a bit, including some very serious ones), but since there's now Zenbleed which requires mitigations that'll probably affect performance by 50% or more, I think most people would want to avoid Zen through Zen 2.
Zen2 only. Zen1, 1+ and 3 don't have the issue.

upcloud found no significant performance inpact

quote:

At the time of publication, AMD released a microcode update for the affected processors. Their mitigation is implemented via the MSR register, which turns off a floating point optimization that otherwise would have allowed a move operation. In our testing, applying this mitigation has not had a detrimental impact on overall server performance.
nobody seems to have benchmarked it, which is weird.

Harik fucked around with this message at 12:48 on Sep 3, 2023

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop
Is there a disk-oriented case besides the meshify-2? Because it doesn't actually come with hardware for anything beyond 6 drives and it's at least another $100 to buy the rest of the drive kits for it.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

IOwnCalculus posted:

Goharddrive is legit. I bought some drives from them via Newegg, one died years later, they refunded that drive in full after I shipped it back. I would absolutely use the savings to buy a spare or two, though.

I bought a bunch of 10TB drives from them recently but none of those have died yet.

cool, awesome. I'll use the savings to go N+2 and probably a hotspare as well. 6x16tb is probably a big enough array, I "only" have 30TB now. 12 of of it will retire when I upgrade, a handful of 4tb drives where the entire array fits on one 16 with room to spare.

Charles Leclerc posted:

Node 804 can do 12 drives before you need to buy extra brackets - 2x 4 drive cages in the back, 2 drives on the chassis floor and 2x 2.5 drives in the front panel. With a bit of fuckery and tight packaging I managed to get 12 3.5" drives in there but the airflow was a real problem. I've since migrated to a Define 7XL but you're back at the problem of needing to buy a bunch of brackets to fully use up the available space.

Unfortunately I'm planning on putting a full ATX epyc server board in and the 804 is mATX. Otherwise it's an awesome case.

I found a source for bare Fractal Design drive trays for $4 each and I probably have enough of the rubber dampeners and screws kicking around from the dozens of machines I've made with only SSD or NVMe so I'll just get those. An extra $30 on the case price isn't all that bad.

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

IOwnCalculus posted:

I have had ZFS poo poo bricks due to this before and I suspect it's because I run multiple vdevs. It sees all the disks and brings the pool up but starts failing checksums.
hey it's story time prompted by a particularly nasty flashback this post brought on!

in the long ago days of parallel SCSI drive racks, I once built a pair of systems hooked up to 4 external drive expanders via 2 SCSI cables each. using a LSI/Adaptec controller IIRC.

Some genius moved things around and swapped one set of cables, and adaptec had zero metadata on the drives so just happily discovered drives by SCSI ID based on what it had in it's EEPROM and started serving entirely random data and instantly destroyed both of the redundant arrays.

After explaining to the client that their stupidity was the cause and we had told them in writing not to move the servers without paying someone (me) to make sure it didn't get hosed up, I reflashed the controllers to JBOD and used software raid. I've never touched hardware raid since. They spent the next two months pissing and moaning and rebuilding their dataset. Turns out "we're not wasting money on backing up because it's transient data!" only works if the data is, in fact, transient.

e: I have a super hard time believing ZFS can accidentally import a drive from another pool. It absolutely shits bricks if you try to use a pool on another system without going through a whole "I'm really done with this pool, please mark it clean and don't try to use it at startup anymore" routine before moving the drives.

Harik fucked around with this message at 06:21 on Sep 18, 2023

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

IOwnCalculus posted:

Not another pool, another vdev in the same pool. Think:

code:
raidz1
    sda
    sdb
    sdc
    sdd
raidz1
    sde
    sdf
    sdg
    sdh
Then the next reboot half the drives swap order but because zfs blindly trusts /etc/zfs/zpool.cache and imports the pool in exactly the order above, even though now it has two 'wrong' drives in each vdev.

I suppose you could avoid this by always importing by scanning the disks and never reading zpool.cache... but why not identify the drives in a meaningful manner? I don't miss the old days of mdraid having to reverse-engineer which /dev/sdX is dead and what physical drive that was before it stopped responding to everything.

that's literally madness. why the fu... what. WHAT. Throw a loving uuid in the metadata what the actual christ.

i flat-out refuse to believe zfs will blow up a pool if your drives get imported in the wrong order. that's utterly unacceptable for a filesystem.

e: I'm spinning up a vm just to test this, istg if it does break I'll mock zfs proponents for the rest of eternity.

Harik fucked around with this message at 07:02 on Sep 22, 2023

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

IOwnCalculus posted:

Not another pool, another vdev in the same pool. Think:

code:
raidz1
    sda
    sdb
    sdc
    sdd
raidz1
    sde
    sdf
    sdg
    sdh
Then the next reboot half the drives swap order but because zfs blindly trusts /etc/zfs/zpool.cache and imports the pool in exactly the order above, even though now it has two 'wrong' drives in each vdev.

I suppose you could avoid this by always importing by scanning the disks and never reading zpool.cache... but why not identify the drives in a meaningful manner? I don't miss the old days of mdraid having to reverse-engineer which /dev/sdX is dead and what physical drive that was before it stopped responding to everything.

This is definitively wrong, or at least I can't find a way to reproduce it when testing in a virtual machine.

I tried swapping disks in the same pool, and between pools.

Swapping between pools marked them as UNAVAIL and they had to be re-synchronized when they were re-added:

code:
Sep 22 03:14:18 zfstest zed[1255]: eid=1 class=statechange pool='tank' vdev=vdh1 vdev_state=UNAVAIL
Sep 22 03:14:18 zfstest zed[1271]: eid=2 class=statechange pool='tank' vdev=vdh1 vdev_state=UNAVAIL
Sep 22 03:14:18 zfstest zed[1297]: eid=8 class=statechange pool='testing' vdev=vdc1 vdev_state=UNAVAIL

harik@zfstest:~$ zpool list
NAME      SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank     19.5G   269K  19.5G        -         -     0%     0%  1.00x  DEGRADED  -
testing  19.5G  1.00G  18.5G        -         -     0%     5%  1.00x  DEGRADED  -
but there was no data loss. At worst, if enough drives shuffled around it would fail to bring up the zpool until you located them by their new devicenames.

swapping within a pool caused nothing to happen.

returning the drives to their original order caused it to self-heal immediately:
code:
harik@zfstest:~$ zpool status 
  pool: tank
 state: ONLINE
  scan: resilvered 41K in 00:00:00 with 0 errors on Fri Sep 22 03:20:13 2023
config:

	NAME        STATE     READ WRITE CKSUM
	tank        ONLINE       0     0     0
	  raidz1-0  ONLINE       0     0     0
	    vdf     ONLINE       0     0     0
	    vdg     ONLINE       0     0     0
	    vdh     ONLINE       0     0     0
	    vdi     ONLINE       0     0     0

errors: No known data errors

  pool: testing
 state: ONLINE
  scan: resilvered 62.5K in 00:00:00 with 0 errors on Fri Sep 22 03:20:13 2023
config:

	NAME        STATE     READ WRITE CKSUM
	testing     ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    vdb     ONLINE       0     0     0
	    vdc     ONLINE       0     0     0
	    vdd     ONLINE       0     0     0
	    vde     ONLINE       0     0     0

errors: No known data errors
If I'd been writing a lot to it in degraded the resilvering would have been more exensive I assume.

Harik fucked around with this message at 08:21 on Sep 22, 2023

Adbot
ADBOT LOVES YOU

Harik
Sep 9, 2001

From the hard streets of Moscow
First dog to touch the stars


Plaster Town Cop

BlankSystemDaemon posted:

All of the disks in my SAS chassis' are whole-disk, but even if I used LInux, it wouldn't really address the Linuxism that's at the bottom of the issue.

The only 'linuxism' in play here is sequentually naming drives as they're found, which doesn't result in corrupted blocks because ZFS stores enough metadata on the drives to know where they belong in the pool. I tested swapping disk images around inside a pool when the VM was off and ZFS gave zero shits. I'm not sure what problem you ran across but it either doesn't exist anymore or is different than what you're describing.

The only thing I found that was a problem was if it finds a disk missing in the cache file it doesn't do a GUID scan of other drives in the system to reclaim it. Maybe there's good reason for that, I can think of a few (disk snapshots being presented as volumes, for example. You don't want to mount a readonly drive snapshot as part of your raidz!)

The defaults picked prioritize availability, so as long as the array can function it will import it and fire it up, even missing disks. With whatever journaling they're doing discovering the missing disk has a different name for whatever reason it doesn't require a full resilver to bring it back into the pool.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply