NAS/Storage Megathread: What is this "File Deletion" You Speak of

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > NAS/Storage Megathread: What is this "File Deletion" You Speak of

«‹›845 »

phosdex: Dec 16, 2005

Hed posted:

I'm looking to give my FreeNAS-11.2-U8 install some love. I have 5 6TB drives in a RAID-Z2. Just making sure that if I want to upgrade I need to buy 5 <newsize> drives and would have to power down and replace the drives one by one 5 times to get the new size?

The case I have is not a hot-swap thing but I'm open to switching that all at once if it would be easier.

edit: Supermicro Xeon D-1541 mobo with a Fractal 804 case.

I just finished building a new server, re-used my 804. It seems like it's still one of the better cases.

Not sure how many sata ports you have, but you could add the new drives up to your max ports. Then you don't have to power down for each swap.

# ? Apr 18, 2022 13:48

Adbot: ADBOT LOVES YOU

# ? May 15, 2024 12:27

withoutclass: Nov 6, 2007; Resist the siren call of rhinocerosness; College Slice

I'm not at all sure of RaidZ but when I replaced all my disks in my striped mirror vdev I swapped one by one, power off/on, and then triggered a resilver and when I was done swapping all four disks my pool size was automatically upgraded.

# ? Apr 18, 2022 13:53

BlankSystemDaemon: Mar 13, 2009

phosdex posted:

I just finished building a new server, re-used my 804. It seems like it's still one of the better cases.

Not sure how many sata ports you have, but you could add the new drives up to your max ports. Then you don't have to power down for each swap.

Won't work until raidz expansion is implemented, which should hopefully be in OpenZFS 3.0.

withoutclass posted:

I'm not at all sure of RaidZ but when I replaced all my disks in my striped mirror vdev I swapped one by one, power off/on, and then triggered a resilver and when I was done swapping all four disks my pool size was automatically upgraded.

Replacing one disk at a time has always worked, but the autoexpand zpool property that I mentioned, which is listed in zpoolprops(7), is what makes the pool expand itself automatically. As you can see from the manual page, its default is off so you (or whoever set up the pool) probably flipped it to on.

# ? Apr 18, 2022 14:52

withoutclass: Nov 6, 2007; Resist the siren call of rhinocerosness; College Slice

BlankSystemDaemon posted:

Won't work until raidz expansion is implemented, which should hopefully be in OpenZFS 3.0.

Replacing one disk at a time has always worked, but the autoexpand zpool property that I mentioned, which is listed in zpoolprops(7), is what makes the pool expand itself automatically. As you can see from the manual page, its default is off so you (or whoever set up the pool) probably flipped it to on.

I never did this intentionally so it's possible an upgrade of TrueNAS has it enabled. I don't even know where to check to see if it's enabled.

# ? Apr 18, 2022 15:19

Klyith: Aug 3, 2007; GBS Pledge Week

So would it work to shutdown a freenas server, boot to a different OS, and just directly clone each drive to a new one? And then go back to freenas and have it do a full check / scrub?

It seems like resilvering a full drive in parity-based raid systems is always a kinda slow process, because it's reconstructing every data block from parity. And you have to do one at a time for every drive. If you did direct copies it would run at max write speed, plus you could do multiple drives at once.

The copy operation would not have 100% data validation in the same way that a resilver does... But you'd still have all the parity data, so if any copy was flawed it would be caught (and likely repaired) right away during the scrub. Plus you'd still have the original drives.

# ? Apr 18, 2022 15:55

IOwnCalculus: Apr 2, 2003

BlankSystemDaemon posted:

Won't work until raidz expansion is implemented, which should hopefully be in OpenZFS 3.0.

They don't mean raidz expansion, they just mean being able to have more than one new drive physically installed per power cycle. They'd still need to resilver for each disk individually, just with fewer power cycles needed to swap drives between resilvers.

# ? Apr 18, 2022 16:52

Shumagorath: Jun 6, 2001

At what point in the number of drives and ZFS redundancy level does an SSD write cache stop mattering, assuming the NAS is on a gigabit LAN with 1-2 concurrent clients? I ask because space is at a premium, and a proper hot-swappable vendor-made TrueNAS device fits into my remaining space / power budget better than a homemade Define 7 or even a Node/Silverstone. Since they're quite expensive I'm wondering what corners I can cut for my use case.

# ? Apr 18, 2022 18:11

CopperHound: Feb 14, 2012

Oh boy that's a can of worms, but iops is probably the limiting factor that doesn't scale up without additional vdevs.

Also, write cache only matters with synchronous writes, which you might not need or want for your use case.

CopperHound fucked around with this message at 19:26 on Apr 18, 2022

# ? Apr 18, 2022 19:20

BlankSystemDaemon: Mar 13, 2009

Shumagorath posted:

At what point in the number of drives and ZFS redundancy level does an SSD write cache stop mattering, assuming the NAS is on a gigabit LAN with 1-2 concurrent clients? I ask because space is at a premium, and a proper hot-swappable vendor-made TrueNAS device fits into my remaining space / power budget better than a homemade Define 7 or even a Node/Silverstone. Since they're quite expensive I'm wondering what corners I can cut for my use case.

You don't have an SSD write cache on ZFS.
You have the Separate Intent Log, also known as the SLOG, and internally known as a log device (preferably two or more devices as a mirror). It temporarily but persistently (in case of power-loss, so you want capasitor-backed SSDs) stores synchronous writes and rearranges them such that they can be written sequentially, so they can be more easily read off the disk subsequently.

Asynchronous writes cannot be cached, but they can be stored on a separate device; you need to use a special vdev (which also stores the metadata, so preferably two or more devices as a mirror) and the per-dataset small_small_block property set to whichever size blocks you want to store on the special vdev - typically, the kind of asynchronous writes you want to store on the special vdev are small writes, ie. when something doesn't fill the whole recordsize.

Operations per second has always been the limiting factor for any RAID (or, indeed, single-disk devices), not just ZFS.
If you need database/SAP/enterprise levels of IOPS, you want to use NVMe instead as not only does it have more IOPS, it also has more queues - AHCI only has one queue, and you need NCQ or TCQ for reordering (the latter is for SCSI and SAS, but there's still only one queue).

BlankSystemDaemon fucked around with this message at 19:36 on Apr 18, 2022

# ? Apr 18, 2022 19:32

Shumagorath: Jun 6, 2001

If the device role is just bulk file storage and Plex server it sounds like I can get away without SSDs entirely...?

# ? Apr 18, 2022 19:36

BlankSystemDaemon: Mar 13, 2009

Shumagorath posted:

If the device role is just bulk file storage and Plex server it sounds like I can get away without SSDs entirely...?

Absolutely.

# ? Apr 18, 2022 19:36

Shumagorath: Jun 6, 2001

iXSystems absolutely fleece you on drives so I'd be buying the appliance empty anyhow. What about RAM? I don't know the prices for ECC but 64GB sounds excessive for five bays...?

# ? Apr 18, 2022 21:09

BlankSystemDaemon: Mar 13, 2009

Shumagorath posted:

iXSystems absolutely fleece you on drives so I'd be buying the appliance empty anyhow. What about RAM? I don't know the prices for ECC but 64GB sounds excessive for five bays...?

Assuming the motherboard has four sockets and the CPU does dual-channel memory access, it's probably better to get as much memory in two DIMMs, so that you can expand by simply buying two more DIMMs at a later point.

# ? Apr 18, 2022 21:29

phosdex: Dec 16, 2005

Shumagorath posted:

iXSystems absolutely fleece you on drives so I'd be buying the appliance empty anyhow. What about RAM? I don't know the prices for ECC but 64GB sounds excessive for five bays...?

You would be perfectly fine with 16GB, which I think you can find for about $100 on ebay.

# ? Apr 18, 2022 21:50

IOwnCalculus: Apr 2, 2003

BlankSystemDaemon posted:

Absolutely.

The only possible caveat here is if you are populating said server with the wonders of torrents and/or usenet, having a small cheap SSD that's not actually part of the pool as a scratch disk can come in handy. Your horribly fragmented downloads get put together / unpacked on the SSD, and then *arr / your client can move the file over to the ZFS pool as a big continuous write.

Also, unpacking nzbs is miserable on spindles.

But used in this way it can be literally the cheapest nastiest "at least it's from a company I've heard of before" SSD you can get your hands on, because when it shits the bed it doesn't impact your pool.

# ? Apr 18, 2022 21:51

BlankSystemDaemon: Mar 13, 2009

IOwnCalculus posted:

The only possible caveat here is if you are populating said server with the wonders of torrents and/or usenet, having a small cheap SSD that's not actually part of the pool as a scratch disk can come in handy. Your horribly fragmented downloads get put together / unpacked on the SSD, and then *arr / your client can move the file over to the ZFS pool as a big continuous write.

Also, unpacking nzbs is miserable on spindles.

But used in this way it can be literally the cheapest nastiest "at least it's from a company I've heard of before" SSD you can get your hands on, because when it shits the bed it doesn't impact your pool.

This works, but as an alterantive you can just use a scratch dataset where you disable compression, primarycache, sync, et cetera - then download onto that and wait until things are complete, then move them to their permanent location.

# ? Apr 18, 2022 21:56

Hed: Mar 31, 2004; Fun Shoe

Thanks everyone for the replies. Sounds like I'll find my elbow in the HDD curve, upgrade these 6TB drives one-by-one with a resilver in between. Thanks for telling me about the zpool expand flag!

# ? Apr 19, 2022 16:13

CopperHound: Feb 14, 2012

I forget if you are able to plug in extra drives, but wouldn't running

code:


zpool replace pool old_device new_device

With the old drives still in there run faster from not needing to do the parity calculations?

E: further reading suggests that I may be wrong.

# ? Apr 19, 2022 16:24

IOwnCalculus: Apr 2, 2003

I've seen no difference in performance with a resilver whether or not the drive being replaced is still present and healthy.

Which is sort of annoying I suppose, it seems like in that scenario you really could just do a sequential read off of the old and write onto the new, but there must be other work that needs to go on as well that makes this impossible.

# ? Apr 19, 2022 16:44

CopperHound: Feb 14, 2012

Can you at least run the resilver on multiple drives at once instead of a one at a time replace?

# ? Apr 19, 2022 16:49

Nulldevice: Jun 17, 2006; Toilet Rascal

I've wondered if it's possible to do two disks at a time in a raidz2 when upgrading, but I've never tried. Maybe I'll fool around with it in a demo setup to see if it's possible. Figure if you're not breaking the array parity it could work in theory. But man I'd be pucker city while that rebuild was going on.

# ? Apr 19, 2022 18:08

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

If I want to go down the �san� route of having a fileserver that is only a (FreeBSD?) high speed data store and then linking over Infiniband QDR or FDR or 10 or 40 gbe to an application server (Linux?) that does the actual hosting, what is going to be the keywords I need to search for there?

Is this where you do iSCSI and serve block devices? If so, what filesystem do you put on the block devices, and doesn�t that have additional overhead/performance impact vs a ZFS managed dataset both at the SAN and app server levels?

The other way is NFS, right? And that has performance impact too of course.

For most stuff it absolutely won�t matter for a homelab level project, but Postgres has always been a pain point there. The snapshot stuff is super useful for migrations on that kind of stuff (although it can be done other ways ofc) and I want to toy around. I know there is going to be some performance impact (and I�m totally onboard with an Optane SLOG to help mitigate that) and I�ll be playing around with some tuning options and pgbench to see what works best/try to quantify the performance impact there. But it�d be helpful to start at least in the neighborhood of the right answer, using the right technologies at least :haw:

Kinda wondering if database services might be the exception and maybe that�s a service that gets provided by the SAN server even if the applications run on something else.

# ? Apr 19, 2022 19:13

BlankSystemDaemon: Mar 13, 2009

CopperHound posted:

I forget if you are able to plug in extra drives, but wouldn't running
code:
zpool replace pool old_device new_device
With the old drives still in there run faster from not needing to do the parity calculations?

E: further reading suggests that I may be wrong.

I thought the case had all of its bays in use, but yeah if there's one free then this is definitely the way to go.

IOwnCalculus posted:

I've seen no difference in performance with a resilver whether or not the drive being replaced is still present and healthy.

Which is sort of annoying I suppose, it seems like in that scenario you really could just do a sequential read off of the old and write onto the new, but there must be other work that needs to go on as well that makes this impossible.

It won't affect raidz resilver because that's limited by the write speed of the disk you're resilvering onto (or, if the machine is slow enough and doesn't have SSE/SIMD/AVX2, by the Reed-Solomon reconstruction rate) - but what it does is it lets you resilver without compromising the P+Q distributed parity you have so that two individual disks have to fail before you're at risk of URE.

Nulldevice posted:

I've wondered if it's possible to do two disks at a time in a raidz2 when upgrading, but I've never tried. Maybe I'll fool around with it in a demo setup to see if it's possible. Figure if you're not breaking the array parity it could work in theory. But man I'd be pucker city while that rebuild was going on.

CopperHound posted:

Can you at least run the resilver on multiple drives at once instead of a one at a time replace?

I think you're both asking the same thing, so I'll answer them as one:
It's definitely possible, but if you encounter an URE during the rebuild on a record that doesn't have a ditto record (ie. usually just data, unless you set copies=[23] on a dataset, as metadata has two ditto blocks), that record will be broken and you'll need to restore the file from backup.

# ? Apr 19, 2022 19:30

Motronic: Nov 6, 2009

Paul MaudDib posted:

If I want to go down the �san� route of having a fileserver that is only a (FreeBSD?) high speed data store and then linking over Infiniband QDR or FDR or 10 or 40 gbe to an application server (Linux?) that does the actual hosting, what is going to be the keywords I need to search for there?

Is this where you do iSCSI and serve block devices? If so, what filesystem do you put on the block devices, and doesn�t that have additional overhead/performance impact vs a ZFS managed dataset both at the SAN and app server levels?

My setup is a TrueNAS with 10GbE to my switch, ESXi with 10GbE to my switch and the second port of each 10 card connected directly together with a DAC on it's own IP subnet.

In TrueNAS the iSCSI device is a file on the filesystem. At least the way I did it - not sure if there is a different way.

It's been working quite well. But I'm just a thread lurker/asking questions and certainly no expert on any of this. I have a sample size of exactly 1.

# ? Apr 19, 2022 20:08

Chilled Milk: Jun 22, 2003; No one here is alone,
satellites in every home

IOwnCalculus posted:

The only possible caveat here is if you are populating said server with the wonders of torrents and/or usenet, having a small cheap SSD that's not actually part of the pool as a scratch disk can come in handy. Your horribly fragmented downloads get put together / unpacked on the SSD, and then *arr / your client can move the file over to the ZFS pool as a big continuous write.

Also, unpacking nzbs is miserable on spindles.

But used in this way it can be literally the cheapest nastiest "at least it's from a company I've heard of before" SSD you can get your hands on, because when it shits the bed it doesn't impact your pool.

Hmmm. My setup grew into:
Boot pool
My main spinny disk pool
Two 2tb SSDs in a mirrored pool that I run my VMs off of. Microcenter is a great source of cheap these.

Downloads do just live on that spinny pool though. Haven't noticed any issues but it might save a significant amount of wear if I rejiggered it the way you described. 🤔 I do have my *arr apps use hardlinks though so I can continue seeding my ISOs after they've been imported without storing twice.

# ? Apr 19, 2022 20:26

Methylethylaldehyde: Oct 23, 2004; BAKA BAKA

Paul MaudDib posted:

If I want to go down the �san� route of having a fileserver that is only a (FreeBSD?) high speed data store and then linking over Infiniband QDR or FDR or 10 or 40 gbe to an application server (Linux?) that does the actual hosting, what is going to be the keywords I need to search for there?

Is this where you do iSCSI and serve block devices? If so, what filesystem do you put on the block devices, and doesn�t that have additional overhead/performance impact vs a ZFS managed dataset both at the SAN and app server levels?

The other way is NFS, right? And that has performance impact too of course.

For most stuff it absolutely won�t matter for a homelab level project, but Postgres has always been a pain point there. The snapshot stuff is super useful for migrations on that kind of stuff (although it can be done other ways ofc) and I want to toy around. I know there is going to be some performance impact (and I�m totally onboard with an Optane SLOG to help mitigate that) and I�ll be playing around with some tuning options and pgbench to see what works best/try to quantify the performance impact there. But it�d be helpful to start at least in the neighborhood of the right answer, using the right technologies at least

Kinda wondering if database services might be the exception and maybe that�s a service that gets provided by the SAN server even if the applications run on something else.

It really all depends on your workload, IOPS needs, management needs, and backing storage. ZFS backed iSCSI LUNs are a 100% valid way to do things, but you trade the SAN host's awareness of storage for the VM's awareness of storage. Caching behaves a little differently, compression can act a little oddly, but it behaves in the VM like a raw local disk so you can do anything to it that you could on a demo laptop or test system.

My hobo-SAN whitebox monstrosity is an old DDR3 era Supermicro case and mobo with an old Xeon in it, 32gb of ram, running Solaris. Choosing between iSCSI, SMB and NFS, I wanted at rest disk encryption and a windows based backup utility that needed VSS snapshots to work, so I ended up using iSCSI. The backing store is a RAIDZ2 10 disk array, with no flash caching or anything fancy. The ZFS volume was published via COMSTAR iSCSI LUNs, mounted to my file server VM, and formatted NTFS. It hosts media, stores files, and generally does what I need it to without hardly any issues. Current system uptime is almost 400 days.

I'm really tempted to replace the mobo/cpu/ram to modernize it, and throw this stupid thing on it. Slot 8 M2 SSDs in it, make it a 2 vdev 4 drive RAIDZ1 array for double iops vs an 8 drive RAIDZ2, enjoy having a SMB share that can casually saturate 2 SFP+ links.

# ? Apr 19, 2022 20:29

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

Chilled Milk posted:

do have my *arr apps use hardlinks though so I can continue seeding my ISOs after they've been imported without storing twice.

Wait, you can�t use hard links across file systems. How are you able to do this without some crazy ARC type setup with performance policies? Not 100% sure if zfs datasets within a pool share inodes is the reason I still do symbolic links

# ? Apr 19, 2022 20:50

BlankSystemDaemon: Mar 13, 2009

Paul MaudDib posted:

If I want to go down the �san� route of having a fileserver that is only a (FreeBSD?) high speed data store and then linking over Infiniband QDR or FDR or 10 or 40 gbe to an application server (Linux?) that does the actual hosting, what is going to be the keywords I need to search for there?

Is this where you do iSCSI and serve block devices? If so, what filesystem do you put on the block devices, and doesn�t that have additional overhead/performance impact vs a ZFS managed dataset both at the SAN and app server levels?

The other way is NFS, right? And that has performance impact too of course.

For most stuff it absolutely won�t matter for a homelab level project, but Postgres has always been a pain point there. The snapshot stuff is super useful for migrations on that kind of stuff (although it can be done other ways ofc) and I want to toy around. I know there is going to be some performance impact (and I�m totally onboard with an Optane SLOG to help mitigate that) and I�ll be playing around with some tuning options and pgbench to see what works best/try to quantify the performance impact there. But it�d be helpful to start at least in the neighborhood of the right answer, using the right technologies at least

Kinda wondering if database services might be the exception and maybe that�s a service that gets provided by the SAN server even if the applications run on something else.

I completely missed this post the first time around, but I wanna start out by saying that Klara are doing a series of articles on using plain FreeBSD for a NAS.

As for iSCSI and block devices, ctld(8) or iser(4) (the latter is use-case specific, as it requires a Mellanox NIC to offload on) both function as iSCSI daemons which can use ZFS volumes (basically, ZFS datasets that act as character devices/disks on FreeBSD) as targets/extents.
They get formatted with whatever file system gets used by the initiator (iSCSI terminology is confusing; the initiator and target mean the opposite of what most people initially think they mean).

One important thing for iSCSI, though, is that irrespective of whether you're using a store-and-forward or cut-through switch (the latter tends to be what Infiniband switches do, because it reduces latency - but it's also more expensive), you absolutely need to ensure that you're doing at least 4096B jumboframes, because a lot of modern filesystems will do native 4k sectors and splitting 4096B across a standard Ethernet consisting of up to 1514 bytes is a recipe for a terrible experience.

Also, when you're dealing with iSCSI, you need to handle synchronous writes properly, which means a pair of small (on the order of 10-20GB, possibly under-provisioned to get as much write endurance) capacitor SLC or MLC flash SSDs - because while ZFS itself won't mind losing a log device nowadays (it used to not like it very much, at all), anything using the iSCSI targets will lose up to 5 seconds of data which it assumed was flushed to disk, and is therefore irrevocably gone forever.

Databases and NFS also rely on synchronous writes though, so any server that's good for iSCSI will also be good for the other two.

Having said all that, I think it's important that you realize you have to be able to maintain it, because I don't wanna be responsible for giving support if things break.

Motronic posted:

My setup is a TrueNAS with 10GbE to my switch, ESXi with 10GbE to my switch and the second port of each 10 card connected directly together with a DAC on it's own IP subnet.

In TrueNAS the iSCSI device is a file on the filesystem. At least the way I did it - not sure if there is a different way.

It's been working quite well. But I'm just a thread lurker/asking questions and certainly no expert on any of this. I have a sample size of exactly 1.

I would highly encourage you to use device extents instead of file extents for your targets, since there's a whole bunch of filesystem/ZPL overhead that you can simply get rid of that way.
It should be as easy as dding your file onto a ZFS volume and switching over the extent from using the file to using the device.

BlankSystemDaemon fucked around with this message at 21:07 on Apr 19, 2022

# ? Apr 19, 2022 20:51

Motronic: Nov 6, 2009

BlankSystemDaemon posted:

I would highly encourage you to use device extents instead of file extents for your targets, since there's a whole bunch of filesystem/ZPL overhead that you can simply get rid of that way.
It should be as easy as dding your file onto a ZFS volume and switching over the extent from using the file to using the device.

That makes sense and it what I'm used to on commercial NAS/SAN offerings. I didn't have an entire pool I wanted to dedicate to it at the time and I'm still not sure if that's the way I really want to go or if I'm going back to local storage for the ESX hosts.

It's given better performance than the SSD they were on locally before and it doesn't seem to make a dent in truenas resources. This is post-10GbE install (thanks again for the pointer on those NICs). Just for redundancy I'm leaning towards keeping them on the NAS except for maybe backup VMs that are going to be backing up to a local disk in the ESX host and maybe a backup DNS resolver.

# ? Apr 19, 2022 21:15

Chilled Milk: Jun 22, 2003; No one here is alone,
satellites in every home

necrobobsledder posted:

Wait, you can�t use hard links across file systems. How are you able to do this without some crazy ARC type setup with performance policies? Not 100% sure if zfs datasets within a pool share inodes is the reason I still do symbolic links

What I meant was "I wonder if there's some combination of file system setup and/or application config that would let me have my cake and eat it too so I can do this thing I'm describing"

Which, just thinking about it again. Just set up a dataset on the SSD pool, download to that, and upon complete, move to a downloads spot on my main pool. Duh

# ? Apr 20, 2022 04:14

Klyith: Aug 3, 2007; GBS Pledge Week

It appears that the standard way to avoid causing excessive fragmentation on ZFS is to make a subvolume for downloads that disables copy on write (and probably snapshots while you're at it).

That way it behaves like a standard filesystem for the purposes of downloads -- download program pre-allocates the file, FS assigns space rationally, tiny 16k writes just get added to the end instead of causing full block copies.

aww poo poo I was looking at a thing where people were talking about both btrfs and zfs, btrfs is the one where you can disable CoW

Klyith fucked around with this message at 16:30 on Apr 20, 2022

# ? Apr 20, 2022 15:39

CopperHound: Feb 14, 2012

Is disabling CoW on a dataset even possible?

# ? Apr 20, 2022 16:01

withoutclass: Nov 6, 2007; Resist the siren call of rhinocerosness; College Slice

This has got me thinking of actually having the temp download directory on my SSD jaildisk and then letting the programs move their files when completed.

# ? Apr 20, 2022 16:09

BlankSystemDaemon: Mar 13, 2009

Klyith posted:

It appears that the standard way to avoid causing excessive fragmentation on ZFS is to make a subvolume for downloads that disables copy on write (and probably snapshots while you're at it).

That way it behaves like a standard filesystem for the purposes of downloads -- download program pre-allocates the file, FS assigns space rationally, tiny 16k writes just get added to the end instead of causing full block copies.

What is fragmentation on a copy-on-write filesystem? If you move a record, the filesystem is no longer copy-on-write.
Also, how are you measuring that fragmentation? ZFS doesn't let you do it, it can only give you the free space fragmentation, which is an indicator of the fragmentation of all the free space on disk that can't be made into contiguous blocks that are whatever your recordsize is set to (by default 128kB).

The problem isn't ZFS, it's that there isn't a standard for how torrent clients should write the data.
Some will insist on syncronously writing individual UDP-packet-sized segments of torrent pieces, which can vary in size from 32kB to 16MB, to disk synchronously, then rewrite that data synchronously once they have the full pieces (and if it fails the checksum, then rewrite it again), then rewrite the full file synchronously, and possibly even rewrite the full torrent synchronously one final time.
Other clients will store individual blocks in a memory-mapped buffer then write the entire torrent piece to disk synchronously before rewriting the file asynchronously once completed.
Still other clients will download everything in the torrent into a memory-mapped segment and only write everything asynchronously once a complete file has been completely written to the memory-mapped segment and checksummed.

Which one is the objectively best one, irrespective of filesystem?
The last one is best for ZFS because it has an dirty data buffer for asynchronous data, and can thus completely avoid fragmentation if you do it that way - but it's also the way that's least commonly used because most developers think filesystems aren't their problem, and users apparently hate their memory being used and would rather have memory that's free and thereby waste electricity.

As a result, the correct recommendation for someone wanting to just download torrents onto ZFS using any of at least a dozen different clients, is to use a dataset with the sync, checksum and primarycache properties set to disabled as the download directory, and hope that the client can at least figure out how to move files contiguously once it's completed.
This last should be immensely improved by the Block Reference Table, which should hopefully also be included in OpenZFS 3.0.

Here's a video on it by the FreeBSD developer who's working on it, from the OpenZFS Developer Summit 2020:
https://www.youtube.com/watch?v=hYBgoaQC-vo

CopperHound posted:

Is disabling CoW on a dataset even possible?

No, it's an inherit property of ZFS, BTRFS, APFS, and basically everything that isn't a clone of a filesystem designed back in 1980 (FFS/UFS, which is still also in FreeBSD).

BlankSystemDaemon fucked around with this message at 16:59 on Apr 20, 2022

# ? Apr 20, 2022 16:32

Keito: Jul 21, 2005; WHAT DO I CHOOSE ?

BlankSystemDaemon posted:

No, it's an inherit property of ZFS, BTRFS, APFS, and basically everything that isn't a clone of a filesystem designed back in 1980 (FFS/UFS, which is still also in FreeBSD).

Btrfs actually allows you to disable COW with the mount option nodatacow, see btrfs(5). This comes with the major caveat that you disable checksumming and compression, which is very important to consider alongside this note from the top of the section:

`btrfs(5)` posted:

Most mount options apply to the whole filesystem and only options in the first mounted subvolume will take effect. This is due to lack of implementation and may change in the future. This means that (for example) you can�t set per-subvolume nodatacow, nodatasum, or compress using mount options.

It's as if they're trying their best to mess up their users' data.

# ? Apr 20, 2022 16:59

BlankSystemDaemon: Mar 13, 2009

Keito posted:

Btrfs actually allows you to disable COW with the mount option nodatacow, see btrfs(5). This comes with the major caveat that you disable checksumming and compression, which is very important to consider alongside this note from the top of the section:

It's as if they're trying their best to mess up their users' data.

I'm not sure how that's better than simply disabling synchronous writes completely, thereby forcing the dirty data buffer to be used.

I do wonder how many people are involved with BTRFS. One of the things I like about OpenZFS is that the developer summits are places where ideas get discussed among the developers - because even if someone has what appears to them to be a great idea, it's usually a bad idea if other developers can poke holes in it.

# ? Apr 20, 2022 17:13

Klyith: Aug 3, 2007; GBS Pledge Week

BlankSystemDaemon posted:

What is fragmentation on a copy-on-write filesystem? If you move a record, the filesystem is no longer copy-on-write.
Also, how are you measuring that fragmentation? ZFS doesn't let you do it, it can only give you the free space fragmentation, which is an indicator of how fragmented all the non-contiguous records that remain are.

Yeah at first I was going say "is this even a problem anyone needs to worry about?" before I got distracted by the copy-on-write thing, which turned out to be btrfs. My bad. I guess the one actual ZFS thing is that if you have turned on automatic snapshots, you could exclude a downloads folder from that.

Free space fragmentation, it seems like the main problem is that people on the internet don't have good guidelines for what % fragmentation to worry about. Like "50% fragmentation" would be bad on a normal FS but is totally normal measuring free space on ZFS. It's a bad thing if the FS has to do a lot of shuffling to write files, but when does that happen in practice?

# ? Apr 20, 2022 17:41

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

BlankSystemDaemon posted:

As a result, the correct recommendation for someone wanting to just download torrents onto ZFS using any of at least a dozen different clients, is to use a dataset with the sync, checksum and primarycache properties set to disabled as the download directory, and hope that the client can at least figure out how to move files contiguously once it's completed.

I mean, the better recommendation is just to buy an Optane drive (or better, a pair) and do SLOG. You can literally buy a unused surplus 16gb M.2 Optane stick for $12 on eBay and that�s plenty both in terms of size and performance. Optane doesn�t really suffer from the write cache problem with flash (you�re fine even without battery backup because writes are de-facto instant, there is never any latency for block erase or other busywork that is necessary due to flash) and in consumer terms they have effectively infinite life even with heavy writes (they are designed as cache drives from the outset) and the latency is an order of magnitude better than flash. They are ideal for this workload and the lovely 16gb Optane sticks (Optane M10/H10 are the model) are dirt cheap because they were mass produced for a while for Optane HDD caching but then flopped and nobody knows what to do with a 16gb SSD.

Do check what size you�re getting, there are 22110 sticks that are longer than the popular 2280 size, but either way silverstone sells an adapter card that will put a M.2 card into a PCIe slot in less than a 1U height (so you can basically put it into any slot that isn�t physically obstructed). They actually sell a couple models, one only does 2280 but they also sell one that does 22110, so check your clearance in your PC and the size of your drive, but they�re only about $20 for the PCIe adapter if you need it. Or if you don�t have height problems the regular cards are like $5.

I think the fragmentation they�re talking about isn�t of the dataset per se, but of the extents in the pool - small non-async writes will fragment the contiguous free space over time, so even a big sequential transfer later can be slowed by having to write a bunch of smaller extents instead of allocating one big one. It�s a noted effect with databases on ZFS for example, and the fix is either you turn async to always on those datasets, or you run slog (which is effectively the same thing but safer/a different set of risks).

Paul MaudDib fucked around with this message at 18:13 on Apr 20, 2022

# ? Apr 20, 2022 18:02

BlankSystemDaemon: Mar 13, 2009

Klyith posted:

Yeah at first I was going say "is this even a problem anyone needs to worry about?" before I got distracted by the copy-on-write thing, which turned out to be btrfs. My bad. I guess the one actual ZFS thing is that if you have turned on automatic snapshots, you could exclude a downloads folder from that.

Free space fragmentation, it seems like the main problem is that people on the internet don't have good guidelines for what % fragmentation to worry about. Like "50% fragmentation" would be bad on a normal FS but is totally normal measuring free space on ZFS. It's a bad thing if the FS has to do a lot of shuffling to write files, but when does that happen in practice?

Free space fragmentation doesn't matter until you get above 80% capacity, and its main effect is to make ZFS have to do more computation in order to write records in a way that make sense in space that won't fit an entire record contiguously.
Even at 90% capacity on a relatively modern pool that was on my old server, I noticed no write penalties that caused me to go under ~116MBps using NFS over TCP.

I believe Windows will do background de-fragmentation nowadays if you're on spinning rust but who uses spinning rust for Windows? I have no idea about Linux using extN.

# ? Apr 20, 2022 18:08

Adbot: ADBOT LOVES YOU

# ? May 15, 2024 12:27

CopperHound: Feb 14, 2012

Paul MaudDib posted:

I mean, the better recommendation is just to buy an Optane drive (or better, a pair) and do SLOG.

Idk, but I would guess probably fine to use asynchronous writes for torrents which would completely negate any slog benefit.

Now L2ARC for seeding on the other hand? There is probably some merit.

# ? Apr 20, 2022 19:06

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > NAS/Storage Megathread: What is this "File Deletion" You Speak of

«‹›845 »

code:

code:

btrfs(5) posted:

`btrfs(5)` posted: