NAS/Storage Megathread: What is this "File Deletion" You Speak of

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > NAS/Storage Megathread: What is this "File Deletion" You Speak of

«‹›45 »

BlankSystemDaemon: Mar 13, 2009

Moey posted:

How would send/receive help me in this current situation though? Without bringing in some temporary storage as a dumping ground for a two-transfer migration.

I don't see any benefit? Unless I am missing some feature of ZFS send/receive (which I admittedly do not use).

Edit:

Oooo, are you suggesting zpool send/receive within the same pool.

Same concept as the script, so every file does a copy then delete (of original)?

I was not aware send/receive could operate within a single pool (if that is what you were hinting at).

Double edit:

Like this? https://forum.proxmox.com/threads/zfs-send-recv-inside-same-pool.119983/post-521105

Yep, got it in one; zfs send -R tank/olddataset@snapshot | mbuffer | zfs receive newdataset, then once it�s done you delete the old one and rename the new one.

If you�re smart about it, you enable zpool checkpoint until you�re satisfied that everything made it over - that way, you can revert administrative changes like dataset removal.
Just don�t forget to turn it off again.

BlankSystemDaemon fucked around with this message at 11:37 on Mar 25, 2024

# ¿ Mar 25, 2024 11:30

Adbot: ADBOT LOVES YOU

# ¿ May 16, 2024 23:03

BlankSystemDaemon: Mar 13, 2009

Moey posted:

Neato. Gracias.

I'll do some testing before migrating data and letting it rip on the actual "final" disk layout.

One trick I was taught early was to truncate a small handful of files, give them GEOM gate devices so they're exposed via devfs the same way memory devices are, and create a testing pool to try commands on.

I still periodically do it if it's been a while since I've done some administrative task and want to make sure I'm doing it right.
This, of course, goes hand in hand with using the -n flag at least once before running it without, on any administrative command.

# ¿ Mar 25, 2024 15:11

BlankSystemDaemon: Mar 13, 2009

Computer viking posted:

Something about playing with a large stack of real disks (before putting them to their final use) feels good, though. I can't explain why.

But truncate can do arbitrary-sized files???

# ¿ Mar 26, 2024 01:40

BlankSystemDaemon: Mar 13, 2009

Computer viking posted:

Sure, but they don't make fun disk access noises.

If you can hear the disk access noises, you should be wearing hearing protection against the fan boise from all the fans in the rack :c00lbert:

BlankSystemDaemon fucked around with this message at 15:20 on Mar 26, 2024

# ¿ Mar 26, 2024 15:18

BlankSystemDaemon: Mar 13, 2009

Wibla posted:

I thought that was BSD's job

Multiple vdevs in one pool will lock you into a certain drive/pool layout though, be aware of that.

Look buddy, I just ~~work~~manual-page-at-people here.

Talorat posted:

Tell me more about that second option. What�s a vdev? Will this allow the single pool to have a single mount point?

A ZFS pool consists of vdevs, each of which is its own RAID configuration, and data is spanned across multiple vdevs.
If you add a vdev to an existing pool, you expand the pool, and data will be distributed across then span such that they ahould end up being approximately equally full.

See zfsconcepts(7).
EDIT: Looking at it, I think this article from Klara explains it best.

BlankSystemDaemon fucked around with this message at 08:25 on Mar 28, 2024

# ¿ Mar 28, 2024 08:17

BlankSystemDaemon: Mar 13, 2009

Just an observation, but if you aren't willing to switch to ZFS, it probably means you either don't have backups, or don't trust them - so you might also wanna address that.

If you do trust your backups, you can remove one of the disks from the existing mirror, put a ZFS pool on it, move data over to it, and then the second disk and use zpool-attach(8) to add the second disk to turn the vdev into a mirror.

If you're using something based on FreeBSD, you can also use gnop(8) and zpool-replace(8).

# ¿ Apr 13, 2024 09:53

BlankSystemDaemon: Mar 13, 2009

Moey posted:

Or they are using a different mature solution. Not everyone is in the same ZFS cult as you.

The backhand poo poo that comes out of you is amazing.

I'm sorry I phrased myself so poorly, it wasn't my intention to come off backhanded, but I can see how I did.
What I should've written is "if you aren't able to switch to ZFS by degrading your existing mirror" - because that's a fairly trivial operation if you have backups.

# ¿ Apr 14, 2024 11:57

BlankSystemDaemon: Mar 13, 2009

ZFS works best with whole disks without any partitioning, if you don�t need to plug the disks into a system likely to try and �initialize� a disk that appears empty, or don�t need things like boot records or swap partitions.

# ¿ Apr 20, 2024 01:52

BlankSystemDaemon: Mar 13, 2009

mekyabetsu posted:

Yup, this is what I did when I created the pool. I ran �zpool create� with 2 drives that were unpartitioned, and that was the result. If that works for ZFS, it�s fine with me. I just wasn�t sure why it chose those particular partition types. I know ZFS was originally a Sun Solaris thing, so it�s probably related to that.

The 8M partitions were created automatically for what I assume is a very good reason.

This makes sense to me. Thank you!

I was phone-posting from bed when responding, so I didn't notice it then - but there's something you do want to take care of: Switch to using /dev/disk/by-id/ for your devices, instead of plain /dev/ devices.
You need to do this, because Linux is the one Unix-like that doesn't understand that it shouldn't reassign drives between reboots (the reasons why it does this has to do with its floppy disk handling) - so there's a small risk that you'll trigger a resilver; typically this isn't a problem, but does degrade the array, meaning that a URE could cause dataloss.

On my fileserver, the 24/7 online pool is is a raidz2 of 3x6TB+1x8TB internal disks in raidz2 totalling ~20TB, and the offline onsite backup pool is just shy of 200TB in total made up of 15x2TB raidz3 vdevs each in their own SAS2 enclosure.
The internal drives are what the system boots to and they're where all the 24/7 storage lives, so they have partitioning both for the EFI System Partition, a swap partition on the 8TB, and the rest is used for root-on-ZFS
The external drives are all completely unpartitioned, because this lets me simply run sesutil locate to turn on a LED to make it easy to identify the disk that needs replacing, and then I just go pull the disk and insert a new one - this is the advantage of unpartitioned disks, because zfs automatically starts replacing the disk on its own, and if all devices in a vdev have been replaced with something bigger, the vdev grows automatically too (this is accomplished using the autoreplace and autoexpand properties documented in zpoolprops(7)).

EDIT: I still need to figure out if it's possible to automatically turn on the fault LED in FreeBSD.
Trouble is, every failure of spinning rust I've had has been the kind of error that's hard to know about without ZFS (and about half have been impossible to figure out by using S.M.A.R.T alone), so I'm not sure I'd even have benefited.

BlankSystemDaemon fucked around with this message at 13:20 on Apr 20, 2024

# ¿ Apr 20, 2024 13:10

BlankSystemDaemon: Mar 13, 2009

Computer viking posted:

Depends on if you use it as your everything-server, I guess - with enough VMs or containers it could make sense.

Even if you do use it as an everything-server, it�s not exactly easy to conpletely exhaust the amount of CPU cycles a server has.
A FreeBSD machine can sit with a load average of 2x the number of cores it has, and still manage just fine so long as the priorities of synchronous, interactive and asynchronous tasks aren�t hosed up.

BlankSystemDaemon fucked around with this message at 13:46 on Apr 22, 2024

# ¿ Apr 22, 2024 13:41

BlankSystemDaemon: Mar 13, 2009

movax posted:

But, question then -- for my metadata / special vdev, currently I have 2x 7.68TB SATA SSDs in a mirror. If I have 4 ports available, should I do a RAID-Z1 on that? Or, if I want to maintain the double-drive failure tolerance (less worried about this on a SSD to be fair), is my only choice a 4-drive RAID-Z2? Striped mirrors doesn't make sense because the wrong 2 drives there nuke the vdev whereas RAID-Z2 is any two. It's not NVMe but it's 6Gbps SATA on the PCH controller for a server mostly doing Linux ISOs, so I think it's fine.

I'll use the 1TB NVMe in the PCIe x1 slot as a L2ARC. No ZIL since its all async / spinning rust storage. boot-pool as 2x Innodisk SATA-DOMs (SLC) for system data-set. PLEX/VMs/etc all live on a different box.

For special and log vdevs, you're better off sticking to mirroring, since raidz1 adds padding and is way more affected by fragmentation if you ever get to the point of having low amounts of contiguous LBAs of free space.
You can N-way mirroring, though - so for example, instead of mirroring across 2 disks, you can mirror across more than two; that way, you can lose two disks and still not lose the pool.

Also, it's worth mentioning that SAS can be daisy-chained up to 5 SAS enclosures per external SAS port - but unfortunately this is only available for real SAS enclosures.
For DIY, the best you can do is find a way to supply power to a SAS expander in the UNAS case, then buy SAS expanders to reduce the number of cables going from/to each device.

Minor nit: The ZFS Intent Log is part of the on-disk specification - when you add a slog device, you're really adding a separate log device to be used instead of the ZIL.

BlankSystemDaemon fucked around with this message at 09:18 on Apr 25, 2024

# ¿ Apr 25, 2024 09:11

BlankSystemDaemon: Mar 13, 2009

I just use plain FreeBSD with zfs, bhyve, and jails - it works great, because it isn't limited to what the appliance makers has decided the appliance should be capable of, and equally importantly what they've decided they won't support.

# ¿ Apr 26, 2024 15:00

BlankSystemDaemon: Mar 13, 2009

movax posted:

Thanks -- I may look at adding just one drive then so the redundancy is matched, cheaper that way as well vs buying two more SATA SSDs.

On SAS, I just got more SATA Drives so I'm stuck w/ SATA... should be fine, I imagine, and with not populating a mobo / CPU / etc in the case I expect a lot more room to work + airflow. Will have to see how good / quiet the stock fans are.

How do I go about sizing the special vdev, or I guess, 'reverse' sizing it -- I'm not worried about that storage for just metadata, but if I set that recordsize/whatever parameter to say 10M, then every file smaller than 10M will actually live on the SSD mirror, right? Until it fills up... and then do I gracefully degrade to having metadata move to HDDs, and I have to basically do a copy off/back on to fix the issue?

SAS HBAs support either the native SATA protocol or Serial ATA Tunneling Protocol if you're using SAS expanders.
So it really shouldn't be an issue, so long as you avoid some of the EMC SANs of the past, where they'd actually lock out those features until the customer paid for them.

EDIT: The biggest real difference with SAS is that you get access to the SCSI READ DEFECT DATA command, which is what we all wish S.M.A.R.T was.

BlankSystemDaemon fucked around with this message at 23:19 on Apr 26, 2024

# ¿ Apr 26, 2024 23:13

BlankSystemDaemon: Mar 13, 2009

Hardware for a NAS is easy to get right. Buy a low-power Intel i3 or AMD APU with proper ECC support, throw in some memory, and make sure you have enough SATA and PCIe ports/slots to fill the bays.

Software-wise, there's an unenumerable amount of ways to gently caress up.

# ¿ Apr 29, 2024 19:12

BlankSystemDaemon: Mar 13, 2009

Korean Boomhauer posted:

Dumb question but is there anything that truenas scale does zfs-wise that i couldn't replicate with a cronjob if i were to just use zfs on proxmox and have a nfs/smb container somewhere? Like what all would I have to setup to make sure my data doesn't get exploded aside from a weekly scrub? I'm pretty sure I can view drive health right in proxmox as well.

I was originally going to do a truenas scale vm and passthrough one of the sata controllers and use that for NAS, but I wanna have some of my docker containers use some of the space on the NAS, like tubearchivist can store videos there, or romm can access roms on it (for 24 hours, after which they'll be deleted). I was reading that docker and nfs can be dicey, but nfsv4 was fine? I don't know!

Anything that an appliance OS does can typically be done by a general OS, and the general OS can do much more - it's just that you have to spend time getting it to work, whereas the appliance is supposed to be simpler.

Gonna Send It posted:

I like Sanoid/Syncoid for my simple home ZFS backup uses. I have it setup to pull from the primary server to the backup server via ssh juuuuust in case the primary is somehow compromised.

https://github.com/jimsalterjrs/sanoid

Yea, Jim Salters sanoid is the go-to-thing at this point, because it does everything right and doesn't try to gently caress with anything that isn't its own snapshots.

# ¿ May 3, 2024 18:55

BlankSystemDaemon: Mar 13, 2009

IOwnCalculus posted:

These are two very different failure modes. Total and unrecoverable loss of three drives in the original vdev or four in the new vdev will cause you to lose the whole array, yes.

As long as the HBA/shelf/cable don't fail in a way that results in writing a huge amount of garbage to the disks, the most you'll experience is downtime until you resolve the issue, plus possibly a small amount of data corruption. Remember that ZFS was originally built for enterprise systems with multiple disk shelves attached to a controller; they had to expect that at some point an entire shelf of disks would disappear for any reason.

To go into detail for anyone interested, who's never had to do it: it's sometimes possible to use zpool import -F and follow it up with zpool clear as well as a full scrub.
It should, of course, never be run without using the -n flag first, because it's one of those destructive administrative tasks that not even zpool checkpoint can save against (despite being made to guard against destructive administrative tasks), and may also require the use of the -X flag.

Basically, it's the thing you try right before you start restoring from the backups that you've hopefully automated.

# ¿ May 4, 2024 22:14

BlankSystemDaemon: Mar 13, 2009

Pablo Bluth posted:

Best case for 1Gbe is 125MB/s, which would be close to 12 hours (2.5GbE would get that down to under 5). Use iperf3 to get a measure of just network transfer speed without potential disk limits at either end.

Best case for 1Gbps is 116MBps without jumboframes because of 8/10b line coding, Ethernet frames, IP header, TCP segments.
Jumboframes can help a bit, but needs some amount of hardware support.

EDIT: This also assumes a high goodput ratio.

BlankSystemDaemon fucked around with this message at 00:18 on May 5, 2024

# ¿ May 5, 2024 00:15

BlankSystemDaemon: Mar 13, 2009

PitViper posted:

After dealing with a clusterfuck of a drive swap (one disk failing smart and doing a replacement, then two more disks in the second z1 pool giving read errors and being faulted during the resilver) i'm left with a list of about 1400-1500 file errors from status -v.

Spot checking through, nothing seems to be amiss. They're pretty much all replaceable media files, but they all play back, are viewable without noticeable glitches, and whatnot. Is this a case of "replace the file later if errors are found" and otherwise carry on as usual? Nothing in this pool is any sort of irreplaceable, anything on it that's valuable exists in at least 2 other places between the cloud and physical media backups.

I'm currently mid-replacement on one of the two read-error-faulted disks, which are currently working without issue after a reboot and reseating all the drive cables, but it's made me finally decide to replace my 3 remaining disks with about 85k hours each on them. Worst part is the disk that initially kicked off the whole issue only has about 34k hours and was one of the newest in the pool, and never gave any issues day to day other than immediately failing an extended smart test that ran last week.

If anything, this is reinforcing my appreciation of ZFS and how fault-tolerant it is, even in the face of my own abject idiocy.

If they're video files, the saving grace is that a video is a bitstream where 1 frame out of ~24/30/60fps will be (partially) corrupt - or if you're really unlucky, it'll be up until the next keyframe (a maximum of a few seconds of video).
Similarly, textfiles aren't really susceptible to a bitflip, because you can open the file, and probably figure out which value was switched (especially if it's a config file that has a linter/checker).
The worst files to have bitflips happen to is anything that's in a binary format, without any kind of built-in error checking - things like the Windows Registry, logs from systemd, save files for most video games, programs and libraries, and stuff like that.

OpenZFS 2.2 has also implemented corrective receive whereby, if you have a snapshot lying about that contains a good copy of the data that's been flipped, you can use zfs recv to fix it.
There's also an issue open to try and fix if you've only got very very big snapshots, but there's no implementation for that yet.

IOwnCalculus posted:

Right? A regular RAID would've gone completely unrecoverable very early on in the process, here you're dealing with bitrot that's probably nigh-undetectable because it's such a small amount of data in a video file.

The ability of ZFS to know not that something's broken, but what, is kinda magic the first time you get to experience it in action instead of losing your entire array.
It's still not backup, though. :science:

BlankSystemDaemon fucked around with this message at 00:04 on May 8, 2024

# ¿ May 8, 2024 00:01

BlankSystemDaemon: Mar 13, 2009

I can't remember if I've posted this before, but I'm just gonna do it again if so:
The READ column identifies number of failed READ commands issued by ZFS to the kernels disk driver subsystem.
The WRITE column is the same but for WRITE commands.
The CKSUM column identifies number of times that the drive returned-from-READ-command something other than what ZFS knows should be there, based on the checksum.

BlankSystemDaemon fucked around with this message at 00:42 on May 8, 2024

# ¿ May 8, 2024 00:40

BlankSystemDaemon: Mar 13, 2009

Harik posted:

took nearly a year because my dog got sick and wiped out my toy fund (he's fine now, good pupper)

I don't remember if anyone answered the question about PCIe -> U.2 adapters? I want to throw in a used enterprise U.2 for my torrent landing directory (nocow, not mirrored because i'm literally in the process of downloading linux isos and can just restart if the drive dies) and dunno, a mirrored pair of small optanes for service databases and other fast, write-heavy stuff. Maybe zfs metadata?

I've got 2 weeks before the last of the main hardware arrives and I can finally do this upgrade.

For PCIe, M.2 (with the right keying), U.2, and U.3 are all compatible - and unless you need bifurcation, can all be electrically coupled to work with the right adapter.

Also, cute pupper. Please pet him from me :kimchi:

Harik posted:

I wasn't even aware there was u.3 and lol it exists because the sas lines weren't shared with pcie on u.2. why do we have to keep dragging sata/sas into every interface?

On another note entirely, I'm having trouble understanding the arc_summary

I guess some of these units are counts and others are sizes but that's super unclear. Are those 1.9 billion reads, or only 1.9 billion bytes that were cache-eligible? I'm not sure if that's doing great or just doing great on an extremely limited subset of my IO.

U.3 is, at least, the SATA+SAS+PCIe interface we've been wanting for a long time.
So of course now's the time for the hyperscalers to move to E1.L or E3.S, or even using NVMe-over-PCIe for spinning rust, because it simplifies the design of the rack servers.
:negative:

For ARC, the only thing that really matters is the hit/miss ratio.
If you're below 90% and haven't maxed your memory, download more RAM. If you've maxed your memory, you can look into L2ARC.
Just remember that L2ARC isn't MFU+MRU like ARC (it's a simple LRU-evict cache), and that every LBA on your L2ARC device will take up 70 bytes of memory that could otherwise be used by the ARC (meaning you can OOM your system if you add one that's too big).

Combat Pretzel posted:

ARC total accesses? I presume so.

Regarding as to how much or little it is, you have to consider that these VMs run their own disk caches.

ARC is better than the virtualized guest OS' caching, though.

# ¿ May 10, 2024 07:42

BlankSystemDaemon: Mar 13, 2009

Combat Pretzel posted:

FFS, you keep claiming this. It's 70 bytes per ZFS data block.
code:
L2ARC size (adaptive):                                         411.1 GiB
        Compressed:                                    93.2 %  383.4 GiB
        Header size:                                    0.1 %  607.6 MiB
I'd say that's a decent trade-off.

Huh, so it is.
The problem is, records in ZFS are variable size - and there's no real way to get the distribution across an entire pool.

You should still max out your memory before using L2ARC, though.

# ¿ May 10, 2024 21:47

BlankSystemDaemon: Mar 13, 2009

Combat Pretzel posted:

A PR was just opened on the OpenZFS Github that'll enable a failsafe mode on metadata special devices, meaning all writes get passed through to the pool, too, so that in case that your special vdev craps out (for reasons, or simply lacking redundancy), your pool will be safe anyway. Guess I'll be using a spare NVMe slot for metadata, as soon this is deemed stable.

Huh, that's pretty neat!

# ¿ May 13, 2024 19:12

BlankSystemDaemon: Mar 13, 2009

Generic Monk posted:

Lmao I added a SATA ssd to my pool for this only to realise i hadn�t read the small print that you can�t remove it and it�s now a single point of failure for the whole pool. Who knows this might hit truenas before either the drive dies or I nuke and rebuild the pool

With ZFS, there's nothing preventing you from replacing it with an NVMe SSD using the zpool replace command.
ZFS doesn't give a gently caress about what driver you're using, nor what the disk is.

Kibner posted:

It was tunable but could result in system stability issues, iirc. I found this after a brief search: https://github.com/openzfs/zfs/commit/518b4876022eee58b14903da09b99c01b8caa754

I believe that issue has been fixed (either as part of kernel or part of zfs, I don't know) but Scale now uses like 90% of my RAM.

That reminds me, I wonder how the work to integrate ARC into FreeBSDs unified buffer cache is going.

That'll be a pretty big advantage, if it turns out to be possible - something that I'm not quite sure of.
I think it's possible on FreeBSD and Illumos, but it doesn't seem likely to be possible on Linux.

Harik posted:

we were talking about this:

but maybe Generic Monk was talking about the existing implementation, which is lmao levels of bad.

The PR adds the ability to also write allocation classes (ie. both metadata and dedup) onto the pools regular vdevs, instead of only on the vdev used by allocation classes.

I'm interested to learn how Generic Monk managed to learn about the 'special' vdev, without learning that it should always have its own redundancy via mirroring (or even n-way mirroring), though.
All the documentation I know of makes a big deal out of making it very explicit.

BlankSystemDaemon fucked around with this message at 12:16 on May 15, 2024

# ¿ May 15, 2024 12:03

BlankSystemDaemon: Mar 13, 2009

Whoever made that video needs to delete it, because it's almost deliberate misinformation to claim it's a "cache" and that you can use a single disk.

# ¿ May 16, 2024 11:35

Adbot: ADBOT LOVES YOU

# ¿ May 16, 2024 23:03

BlankSystemDaemon: Mar 13, 2009

Combat Pretzel posted:

4:18 "it should have some level of redundancy".

Even when demonstrating this with truncate'd files, I've used redundancy.
Wendel is a good enough educator that he should've known better.

# ¿ May 16, 2024 22:14

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > NAS/Storage Megathread: What is this "File Deletion" You Speak of

«‹›45 »