NAS/Storage Megathread: What is this "File Deletion" You Speak of

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > NAS/Storage Megathread: What is this "File Deletion" You Speak of

«‹›849 »

Keito: Jul 21, 2005; WHAT DO I CHOOSE ?

Klyith posted:

Two, ZFS and/or TrueNas or whichever distro set ZFS up. The ZFS advocates ITT give btrfs a lot of poo poo for having unstable, integrity-not-guaranteed features that can be turned on. If ZFS is critically dependent on ECC memory, that feature to do memory checksumming should be way more exposed so that anyone who doesn't have ECC will turn it on.

ZFS isn't more dependent on ECC RAM than any other file system, it's just that ZFS nerds generally care more about data integrity.

# ? Jun 14, 2022 21:34

Adbot: ADBOT LOVES YOU

# ? Jun 5, 2024 22:34

Yaoi Gagarin: Feb 20, 2014

Yeah the difference is that if you were using ext4 you still would have ended up with hundreds of GB of bad data on disk but you wouldn't even have the console logs, just the random crash.

Also, can't zfs be made to send an email or something when it detects a bad block? I'm pretty sure truenas has a feature like that

# ? Jun 14, 2022 21:45

Combat Pretzel: Jun 23, 2004; No, seriously... what kurds?!

Probably some service/script that regularly checks for changes in the zpool status output.

# ? Jun 14, 2022 21:54

BlankSystemDaemon: Mar 13, 2009

necrobobsledder posted:

ECC is a bit of a double edged sword. While certain error patterns are automatically corrected others that are uncorrectable will produce a hard fault and cause the OS to straight up panic and reboot. Not going to forget the $55k+ system that was purple screening in ESXi randomly because it turned out to be a faulty RDIMM in the end. It is a requirement for systems that need to have strong guarantees that they will not write faulty data and it would be better to crash and stop than to write a single incorrect bit at all. As such, at scale simply dropping the whole server when RAM starts to fail is the right call while on some of the home machines I've had they would crash with uncorrelatable errors and we had no record. With ECC at least I can count the RAM errors right in the BIOS and quickly diagnose that there's a faulty module.

These things all matter to me at home, too. I'm a cheapskate though and am probably going to bow out of ECC with my next NAS build. For BCM purposes I'm putting the savings toward a PiKVM or TinyPilot box (because I'm so not paying for that Spider IP KVM for home use). When Matt Ahrens and other ZFS devs basically say it's not that big of a deal to run ZFS without ECC I'll move on.

Ideally what an uncorrectable error should do is generate an Non-Maskable Interrupt.

The advantage of sending an NMI is that it the OS gets a Machine Check Exception (a class of NMIs, though there are others like the Machine Check Architecture and also ones used for chassis intrusion and such); the MCE contains the physical address of the memory (though it's divided into lower 32bit and higher 32bit - so it's not easily human-parsable), your kernel can use that plus its knowledge of what's in what part of the VM to decide what happens.
It can pick between the following: The system should be panic'd, an application should be killed (without writing its contents to disk aka. SIGKILL in signal(3)), or ignore the error and simply invalidate what was at that address (for example: if it's a read-cache, it's inactive memory, or if it was about to be invalidated such as if it's laundered memory in FreeBSD).
Even better, you can use tools to parse MCE entries and know exactly which memory DIMM is being annoying - though with systems that have multiple banks, it can take a bit of figuring to get the right one.

If ESXi is PSOD'ing, you should be examining the crashdumps to rootcause it.
It's usually very easy to tell that an uncorrectable error caused either a BSOD, PSOD, or the crashdump from Linux/the BSDs.

Computer viking posted:

It's worth noting that I'm in Norway, and something about being in the EEC and Schengen and whatever else is relevant, but not the EU, makes the cost of international shipping here really unpredictable.

I'll keep the Chelsio cards in mind, though.

Everything is more expensive in Norway, though.
A cucumber costs twice what it costs in Denmark, for example.

Klyith posted:

Two, ZFS and/or TrueNas or whichever distro set ZFS up. The ZFS advocates ITT give btrfs a lot of poo poo for having unstable, integrity-not-guaranteed features that can be turned on. If ZFS is critically dependent on ECC memory, that feature to do memory checksumming should be way more exposed so that anyone who doesn't have ECC will turn it on.

To add to what Keito said; that option is hidden in a debug menu, because the places where ZFS was developed initially and the places where it's developed now are all places where they take system stability, data integrity, and everything of that nature very important - so the systems they're using all have ECC memory, and that's the assumption is made with.

Other filesystems will handle the lack of ECC memory even worse than ZFS will - it does its best, to the point that if you want you can enable check-summing in memory (which none of the others can), but no system is perfect.

VostokProgram posted:

Also, can't zfs be made to send an email or something when it detects a bad block? I'm pretty sure truenas has a feature like that

Well, syslog.conf(5) has an example of how to use syslog to send messages to users when there's an alert (or higher) severity message received - but you'll need to set up a forward(5) file and maybe configure an MTA if you want to receive it as mail.

That being said, mail is not appropriate for notifications.
Mail daemons can go down or lose mail, and even if you've got extra mail exchanges and everything setup, it's still a lot more work than setting up appropriate monitoring and using that to generate push events to your phone that has an app or something.

BlankSystemDaemon fucked around with this message at 22:21 on Jun 14, 2022

# ? Jun 14, 2022 22:00

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

Klyith posted:

Yeah, I don't really agree with necrobobsledder saying it's a double edged sword, because "your PC might crash" is about as sharp as a butter knife.

Said like someone that hasn't had to deal with the wrath of a significant other when dealing with a local media service outage. I'm trying to figure out a storage caching mechanism for my Plex setup such that remote files are copied locally while playing for this reason but I haven't had the time to get such a caching mechanism setup yet (seems like people have only come up with using unionfs essentially and mergefs looks scary untested).

# ? Jun 14, 2022 22:02

BlankSystemDaemon: Mar 13, 2009

necrobobsledder posted:

Said like someone that hasn't had to deal with the wrath of a significant other when dealing with a local media service outage. I'm trying to figure out a storage caching mechanism for my Plex setup such that remote files are copied locally while playing for this reason but I haven't had the time to get such a caching mechanism setup yet (seems like people have only come up with using unionfs essentially and mergefs looks scary untested).

Not having ECC doesn't make your system less likely to crash. ECC improves systems stability if you're not insistent on being so cheap that you're getting screwed over by your own choices.
Ie. you need to ensure you're getting a motherboard that won't cause the OS to crashdump, but will generate an NMI as outlined above.

Also, the answer is NFSv4.2 - it has client-side caching and can be configured to not invalidate it if the server disappears and comes back within a limited amount of time.

BlankSystemDaemon fucked around with this message at 22:24 on Jun 14, 2022

# ? Jun 14, 2022 22:22

Klyith: Aug 3, 2007; GBS Pledge Week

Keito posted:

ZFS isn't more dependent on ECC RAM than any other file system, it's just that ZFS nerds generally care more about data integrity.

VostokProgram posted:

Yeah the difference is that if you were using ext4 you still would have ended up with hundreds of GB of bad data on disk but you wouldn't even have the console logs, just the random crash.

Wouldn't the way that ZFS uses lots more memory than other file systems make it much more severely affected?

Most memory errors from bad RAM come from relatively small portions of the address space. The bad memory will only produce bad writes when the data is overlapping the bad ram zone. So like if ext4 is using 20mb of memory and ZFS is using 2GB, the ext4 data will be contaminated 1% as often.

# ? Jun 14, 2022 22:29

priznat: Jul 7, 2009; Let's get drunk and kiss each other all night.

It'll be really neat if and when CXL memory pooling makes it to the enthusiast crowd, E3.s drives hot pluggable filled with DRAM making it really easy to increase your memory if you have the pcie/cxl lanes for it.

That's probably quite a ways off though.

# ? Jun 14, 2022 22:32

Zorak of Michigan: Jun 10, 2006

Klyith posted:

Wouldn't the way that ZFS uses lots more memory than other file systems make it much more severely affected?

Most memory errors from bad RAM come from relatively small portions of the address space. The bad memory will only produce bad writes when the data is overlapping the bad ram zone. So like if ext4 is using 20mb of memory and ZFS is using 2GB, the ext4 data will be contaminated 1% as often.

Most modern OSes will use most of the free memory left to cache I/O, ZFS just manages that cache more intensively. In a world where the average node has gigabytes of RAM, I don't think the difference is real significant. Now, ZFS with dedup enabled, that's a different story, but then the memory utilization goes up so much that everyone agrees you shouldn't use it at all.

# ? Jun 14, 2022 22:38

EVIL Gibson: Mar 23, 2001; Internet of Things is just someone else's computer that people can't help attaching cameras and door locks to!; Switchblade Switcharoo

Your work/play system memory is corrupting all the drat time. It's not even an "could be corrupting", but it definitely is.

EVIL Gibson fucked around with this message at 01:13 on Jun 15, 2022

# ? Jun 14, 2022 23:45

BlankSystemDaemon: Mar 13, 2009

Klyith posted:

Wouldn't the way that ZFS uses lots more memory than other file systems make it much more severely affected?

Most memory errors from bad RAM come from relatively small portions of the address space. The bad memory will only produce bad writes when the data is overlapping the bad ram zone. So like if ext4 is using 20mb of memory and ZFS is using 2GB, the ext4 data will be contaminated 1% as often.

All modern OS' cache stuff in memory so they're unlikely to have much memory free unless it's freshly booted or poorly configured/overprovisioned. Also, it's wasted electricity because the row strobes are exactly equivalent irrespective of whether the memory is free or used.
ZFS' adaptive read cache is what takes up the majority of the space, and like I mentioned previously (on the last page), if a read is corrupt it doesn't matter because it's never going to be overwritten on the disk because ZFS is copy on write.

I don't know why you're inventing hypotheticals, but it doesn't seem appropriate when you yourself linked research that strongly suggests ECC is effective at a much greater degree than you seem to be imagining.

BlankSystemDaemon fucked around with this message at 02:19 on Jun 15, 2022

# ? Jun 15, 2022 02:15

Klyith: Aug 3, 2007; GBS Pledge Week

BlankSystemDaemon posted:

I don't know why you're inventing hypotheticals

I was asking a question. A simple "no, it doesn't work that way" would suffice. I knew that ZFS used a lot of memory, but thought it was doing something more important than just a ZFS-managed replacement for OS caching. Particularly with some people saying "I didn't use ECC and my ZFS died".

Reading more on the subject, there's a widely shared quote from a ZFS dev:

quote:

There's nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem. If you use UFS, EXT, NTFS, btrfs, etc without ECC RAM, you are just as much at risk as if you used ZFS without ECC RAM.

And apparently a whole lot of the "ZFS needs ECC" is from one mod on the trueNAS forums who is wrong but persistently spreads the opinion.

BlankSystemDaemon posted:

you yourself linked research that strongly suggests ECC is effective at a much greater degree than you seem to be imagining.

The research I linked shows that ECC is super-effective at preventing memory errors, but also that most of the problem it prevents is on *particular machines & memory sticks*. Not cosmic rays or memory being inherently terrible at holding 1s and 0s. The mean number of events is not very meaningful in a study where about half of systems had zero errors at all.

There are two valid ways to read this:
1. ECC is great, it saves your bacon from bad ram. You never need to test your memory, because the only states are good or total failure.
2. ECC is mostly a way to avoid periodic testing & maintenance by spending $$$. A machine with basic RAM that memtests 100% clean is likely just as solid as one with ECC.

Nowhere did I say that ECC wasn't effective. Only that a DIYer who didn't the spend $3-500 for a good ECC platform doesn't need to be paranoid. Checking logs and running memtest a few times a year is probably adequate to keep data safe. Not having ECC and not doing maintenance is bad.

# ? Jun 15, 2022 14:57

BlankSystemDaemon: Mar 13, 2009

Klyith posted:

I was asking a question. A simple "no, it doesn't work that way" would suffice. I knew that ZFS used a lot of memory, but thought it was doing something more important than just a ZFS-managed replacement for OS caching. Particularly with some people saying "I didn't use ECC and my ZFS died".

Reading more on the subject, there's a widely shared quote from a ZFS dev:

And apparently a whole lot of the "ZFS needs ECC" is from one mod on the trueNAS forums who is wrong but persistently spreads the opinion.

The research I linked shows that ECC is super-effective at preventing memory errors, but also that most of the problem it prevents is on *particular machines & memory sticks*. Not cosmic rays or memory being inherently terrible at holding 1s and 0s. The mean number of events is not very meaningful in a study where about half of systems had zero errors at all.

There are two valid ways to read this:
1. ECC is great, it saves your bacon from bad ram. You never need to test your memory, because the only states are good or total failure.
2. ECC is mostly a way to avoid periodic testing & maintenance by spending $$$. A machine with basic RAM that memtests 100% clean is likely just as solid as one with ECC.

Nowhere did I say that ECC wasn't effective. Only that a DIYer who didn't the spend $3-500 for a good ECC platform doesn't need to be paranoid. Checking logs and running memtest a few times a year is probably adequate to keep data safe. Not having ECC and not doing maintenance is bad.

The thing is, most OS' including Unix-like have many kinds of caching already - so adding one more isn't bad or good unless the particular implementation is bad or good - and the use of most-frequently-used plus most-recently-used lists (along with shadow-lists to know when stuff has been recently evicted from those) is one of the most efficient forms of caching.

FreeBSD is kind of an exception here, because it's got a unified buffer cache that effectively does what ZFS already does. This has in the past led to a few issues if you're using a system with many sources of memory pressure - but so far as I'm aware, the few remaining issues were fixed in 11.0-RELEASE and 12.0-RELEASE.
Incidentally, there's also rumours of someone working on a way of integrating ZFS into the unified buffer cache - so that's something I'm hoping to hear more about at the FreeBSD devsummit tomorrow.

People who say "I didn't use ECC and my ZFS died" don't ever seem to produce anything that can be rootcaused, so at the very least they can't epistemologically know that it had anything to do with ECC or the lack of it.
Nor is it possible for anyone to actually try and fix it if there are bugs in the code that's the cause of it.

If you're going to quote people, at least don't quote out of context. In particular there is a second paragraph: "I would simply say: if you love your data, use ECC RAM. Additionally, use a filesystem that checksums your data, such as ZFS." --Matt Ahrens.
The context is about the 5-second dirty data buffer for asynchronous writes meaning you can either mitigate it by 1) using ECC, or 2) forcing all writes to be synchronous (sync=always on all datasets, which may slow down your writes a lot unless you also use a pair of mirrored NVMe log devices with high write endurance).
Or you can live with the fact that asynchronous I/O is asynchronous I/O and shouldn't be assumed to be safe until it's been written to disk.

I'm not sure why you feel the need to include that tidbit about someone being wrong on the internet.

As for the study and what it says about particular memory sticks, why are you assuming that the memory sticks you have are better than the ones Google can source?

As for your valid ways, did you read the post of mine I linked to? Because it leaves out the option that you bought motherboard+memory with ECC support, without realizing that you're not going to get notified about DIMM errors from the firmware.
I'm not sure what you're going to get from checking logs. You need to set up monitoring at the very least, but even with that you're not gonna get told about issues with your memory without ECC.
Memtests every quarter of the year is going to add weeks of downtime on a storage platform with plenty of memory, so I'm not sure how that's exactly a good option - and in addition to that, it's a manual process meaning people are going to put it off because they got better things to do.

BlankSystemDaemon fucked around with this message at 22:47 on Jun 15, 2022

# ? Jun 15, 2022 22:36

neurotech: Apr 22, 2004; Deep in my dreams and I still hear her callin'
If you're alone, I'll come home.

I just got a Samsung 970 EVO Plus SSD - is there any value in installing the "Samsung Magician" software?

# ? Jun 16, 2022 00:11

BlankSystemDaemon: Mar 13, 2009

neurotech posted:

I just got a Samsung 970 EVO Plus SSD - is there any value in installing the "Samsung Magician" software?

If the disk has some special S.M.A.R.T or NVMe attributes exported that Samsung Magician knows about because they're vendor-defined, then yes. If not, then no.
The only way you can find out is probably to install it and check against something like gsmartcontrol or HDD guardian, which are open variants of tools to probe the same data, then monitor both/all for differences.

# ? Jun 16, 2022 00:22

Tornhelm: Jul 26, 2008

neurotech posted:

I just got a Samsung 970 EVO Plus SSD - is there any value in installing the "Samsung Magician" software?

Its how you update the firmware on the Samsung SSDs. Unless you're using something like Samsung Rapid (not even sure if that's still a thing anymore) you should probably at least install it at least long enough for that.

# ? Jun 16, 2022 01:03

Klyith: Aug 3, 2007; GBS Pledge Week

BlankSystemDaemon posted:

If you're going to quote people, at least don't quote out of context. In particular there is a second paragraph: "I would simply say: if you love your data, use ECC RAM. Additionally, use a filesystem that checksums your data, such as ZFS." --Matt Ahrens.

Ok, but this puts a price on loving your data. Should people who cannot afford a good ECC platform just not have data at all?

I left that part off not to weaken the case, but because it's a useless statement when talking about a cost-benefit question. Love that calls for ignoring the costs is for people and maybe pets. If you love your data enough to write it a blank check, well, that's kinda sad.

BlankSystemDaemon posted:

I'm not sure why you feel the need to include that tidbit about someone being wrong on the internet.

It is apparently the source of much ZFS-ECC untruths, which I was seeing & deluded by while looking at the subject. Then you accused me of making up hypotheticals. Thanks, btw, hugs to you too.

BlankSystemDaemon posted:

As for the study and what it says about particular memory sticks, why are you assuming that the memory sticks you have are better than the ones Google can source?

Memory is pretty much memory. The silicon is made by 3 companies, there's no difference in the memory modules between ECC and normal. My desktop memory is no better than Google's but it's also not worse, other than not having ECC.

(Google, the company who made "buy cheap generic commodity hardware and use it for servers" into a thing, is probably not buying super-special memory. Especially at the date that study happened -- heck their higher rate of observed errors on DDR1 platforms was probably because they were buying worse stuff than the average enthusiast.)

BlankSystemDaemon posted:

As for your valid ways, did you read the post of mine I linked to? Because it leaves out the option that you bought motherboard+memory with ECC support, without realizing that you're not going to get notified about DIMM errors from the firmware.

Yeah, in fact that was the main post that made me take interest. Because for the DIYer building home NAS, that's a price increase much larger than just buying ECC modules to put on a cheap Ryzen platform. The cost of moving up to an asrock rack or other server board is non-trivial (or involves an awful lot of ebay time).

That information made me look into more about exactly how much value ECC has, both for NAS and in general. I got some bad info from the internet about ZFS, which you've corrected me on. But in general, reading that google paper and thinking about why their results were the way they were, makes me think that regular memory is not a terrible risk if tested regularly.

BlankSystemDaemon posted:

Memtests every quarter of the year is going to add weeks of downtime on a storage platform with plenty of memory, so I'm not sure how that's exactly a good option - and in addition to that, it's a manual process meaning people are going to put it off because they got better things to do.

Memtest is an overnight run even on 32-64GB of memory, which is not a major downtime inconvenience for a home NAS. And if your home server has 128GB or more I'd assume it was ECC anyways because this is clearly a :homebrew:

situation.

It's true that some people don't do maintenance, and I guess those people don't love their data.

# ? Jun 16, 2022 01:12

neurotech: Apr 22, 2004; Deep in my dreams and I still hear her callin'
If you're alone, I'll come home.

BlankSystemDaemon posted:

If the disk has some special S.M.A.R.T or NVMe attributes exported that Samsung Magician knows about because they're vendor-defined, then yes. If not, then no.
The only way you can find out is probably to install it and check against something like gsmartcontrol or HDD guardian, which are open variants of tools to probe the same data, then monitor both/all for differences.

Tornhelm posted:

Its how you update the firmware on the Samsung SSDs. Unless you're using something like Samsung Rapid (not even sure if that's still a thing anymore) you should probably at least install it at least long enough for that.

Thanks for your help

# ? Jun 16, 2022 01:41

BlankSystemDaemon: Mar 13, 2009

Klyith posted:

Ok, but this puts a price on loving your data. Should people who cannot afford a good ECC platform just not have data at all?

I left that part off not to weaken the case, but because it's a useless statement when talking about a cost-benefit question. Love that calls for ignoring the costs is for people and maybe pets. If you love your data enough to write it a blank check, well, that's kinda sad.

It is apparently the source of much ZFS-ECC untruths, which I was seeing & deluded by while looking at the subject. Then you accused me of making up hypotheticals. Thanks, btw, hugs to you too.

Memory is pretty much memory. The silicon is made by 3 companies, there's no difference in the memory modules between ECC and normal. My desktop memory is no better than Google's but it's also not worse, other than not having ECC.

(Google, the company who made "buy cheap generic commodity hardware and use it for servers" into a thing, is probably not buying super-special memory. Especially at the date that study happened -- heck their higher rate of observed errors on DDR1 platforms was probably because they were buying worse stuff than the average enthusiast.)

Yeah, in fact that was the main post that made me take interest. Because for the DIYer building home NAS, that's a price increase much larger than just buying ECC modules to put on a cheap Ryzen platform. The cost of moving up to an asrock rack or other server board is non-trivial (or involves an awful lot of ebay time).

That information made me look into more about exactly how much value ECC has, both for NAS and in general. I got some bad info from the internet about ZFS, which you've corrected me on. But in general, reading that google paper and thinking about why their results were the way they were, makes me think that regular memory is not a terrible risk if tested regularly.

Memtest is an overnight run even on 32-64GB of memory, which is not a major downtime inconvenience for a home NAS. And if your home server has 128GB or more I'd assume it was ECC anyways because this is clearly a situation.

It's true that some people don't do maintenance, and I guess those people don't love their data.

I'm unemployed because my health is hosed, and on the lowest form of benefits in Denmark - and I've managed to scrape together enough money to get not just one but several servers that use ECC and have some form of validation of the ECC behaviour.
All but one are used, and the one that isn't was bought at a ridiculously low price because I'd kept an eye out for it for several years.
You absolutely do not need to spend a lot of money on ECC, you just need to be smart about shopping.

:glomp:

That was what I was getting at; with how common memory with persistent errors is (especially the type that develop after a little while, and can't easily be provoked with the first burn-in), wouldn't it be nice to know about it?

Again, see the initial part of this post; it doesn't have to be expensive, you just have to be smart.

Do you know why regular PCs don't have ECC memory? The CPU has it (check with dmidecode or any tool like it; you'll see that the L1, L2 and L3 caches, if present, are using ECC).
PCs don't have ECC memory because PC clone companies left it out. The original IBM PC spec assumed ECC memory for IBM PCs, because IBM had been designing mainframes with not just ECC, but ECC capable of correcting multibit errors as well as memory mirroring/distributed parity with sparing.

Three-pass memtest86+ is generally regarded as a good amount of testing, and takes a little more than can fit into an overnight run.
More importantly though, you can't do it without rebooting the computer, meaning it can't be automated (least-wise, not without having something out-of-band which also means you could've gone out of your way to get ECC) - so it will not get done.

# ? Jun 16, 2022 05:46

Klyith: Aug 3, 2007; GBS Pledge Week

BlankSystemDaemon posted:

Three-pass memtest86+ is generally regarded as a good amount of testing, and takes a little more than can fit into an overnight run.
More importantly though, you can't do it without rebooting the computer, meaning it can't be automated (least-wise, not without having something out-of-band which also means you could've gone out of your way to get ECC) - so it will not get done.

Memtest86+ is incredibly out of date, it is both slow and ineffective compared to the commercial memtest. Seriously. Don't use it. Get the free version of memtest, it will complete a more intensive 4-pass test quicker than 86+ will do 3 passes. And in some cases will find errors that 86+ can't. No matter how much an open-source fanatic someone is, in this situation using the worse software is cutting off your nose to spite the face.

In the home user context, plugging in a USB stick and rebooting a machine is not difficult. Obviously this is a non-starter for enterprise, but I've only been talking about it in the context of DIY. As I say, anyone who can't put some manual tasks in their calendar doesn't seem like they love their data very much. You need to test your backups periodically too, for most people that's not an automated background thing.

# ? Jun 16, 2022 15:03

Keito: Jul 21, 2005; WHAT DO I CHOOSE ?

I'll just use ECC instead of all this hassle, thanks.

# ? Jun 16, 2022 15:55

Comatoast: Aug 1, 2003; by Fluffdaddy

If you've a tested and long-lasting backup solution, and a filesystem that does checksumming, then I can't see the need for ECC at home. My NAS runs on a commodity HP shitbox that cost $100 with a single drive using ZFS. Presumably if there is ever a problem, it will be caught in the monthly ZFS scrub.

I know it's not perfect, but ECC doesn't make for a perfect system either. I'd rather keep the money and teach myself to not worry.

Comatoast fucked around with this message at 17:22 on Jun 16, 2022

# ? Jun 16, 2022 16:41

CopperHound: Feb 14, 2012

File systems don't know that you are writing corrupt data from memory.

# ? Jun 16, 2022 16:45

movax: Aug 30, 2008

Just to sidestep ECC chat for a second, I'm actually new to having NVMe drive(s) on my desktop (have only had them in my laptops, have been running a 840 / 850 SATA SSD to date) and was playing around with them yesterday.

Samsung PM1735, PM1733 and a SK Hynix PE6011 -- all 'enterprise' NVMe drives. First one is PCIe 4.0 x8, second is PCIe 4.0 x4 and last is PCIe 3.0 x4. I played around using Disk2Vhd to make a VHD of my old SSDs (256 and 512 GB in size) and copying the files between the NVMe drives to see what kind of stupid speeds I would get, in copying single large contiguous files.

They all capped at almost exactly 1.8GB/s... is this expected / am I missing a fundamental limit somewhere? As 'enterprise' drives I'd imagine they'd also be able to maintain that speed with lots of other IO going on, so I tried copying bunches of little files from drive to drive simultaneously, thinking that speed would stay consistent -- it did not. CrystalDiskInfo confirmed they are all linked up at the speeds I'd expect.

Also, I've flipped on the 'write caching' option in Device Manager in the past for my desktop SSDs/storage drives (it's on a UPS), but these 3 drives won't let me -- is that a specific driver thing where I need to hunt down drivers for these on some weird *.ru website?

# ? Jun 16, 2022 17:22

Klyith: Aug 3, 2007; GBS Pledge Week

Keito posted:

I'll just use ECC instead of all this hassle, thanks.

It's a totally justifiable expense for those who want a higher guarantee of safety and less maintenance. Not saying otherwise!

CopperHound posted:

File systems don't know that you are writing corrupt data from memory.

Bad memory that produces regular errors will also make scrubs & checksums fail, independently of invisibly corrupt writes. Say you took a zfs pool that was 100% good and then swapped in bad memory on the machine: it'd start detecting spurious errors. Not because anything is wrong with the data on the drives. Data read from the drives into memory will fail checksum calculations when it lands in the bad address zone.

If your usage pattern to the NAS is heavy on reads and light on writes (this is mainly a media server / plex box), or a large majority of data is unimportant (torrents), that type of spurious error is statistically far more likely than corruption than important data. Reads and read caching will be using far more of the memory than important writes.

I would hope that this would at least produce a lot of logged errors. I don't know if it would trigger zfs into calling the pool degraded or faulted -- a second read of the same data would almost certainly not have errors, so they'd be cleared?

# ? Jun 16, 2022 17:33

Klyith: Aug 3, 2007; GBS Pledge Week

movax posted:

Samsung PM1735, PM1733 and a SK Hynix PE6011 -- all 'enterprise' NVMe drives.

Samsung are very well known for having special drivers for their server / enterprise NVMe drives.

I have mucked with a much older PM9xx drive, where there were some of the real samsung drivers available. That one didn't have a huge difference between windows generic NVMe and the samsung driver, but it was a much slower drive in the first place. I could see the new drives needing the real driver.

movax posted:

Also, I've flipped on the 'write caching' option in Device Manager in the past for my desktop SSDs/storage drives (it's on a UPS), but these 3 drives won't let me -- is that a specific driver thing where I need to hunt down drivers for these on some weird *.ru website?

Some enterprise drives disable that control in device manager because they report to the OS that they don't have the option, and Windows sees that as "you can't turn caching on". But actually what's happening is they're self-managed and won't turn caching off -- because they have nice big capacitor banks on the drive, enough to write their whole cache to flash in a power loss.

If you had the real samsung drivers they don't even have a the option visible in the panel.

# ? Jun 16, 2022 17:53

Combat Pretzel: Jun 23, 2004; No, seriously... what kurds?!

Neat, going from an Xeon 1230V2 4C4T to an Ryzen 5750G 8C16T dropped my idle power usage from 42W to 34W :v

--edit: And back up to 39W, because I had to add some more airflow to keep this X570 in check.

Combat Pretzel fucked around with this message at 18:11 on Jun 22, 2022

# ? Jun 22, 2022 17:55

Korean Boomhauer: Sep 4, 2008

Alright! I think I have some parts picked out. I kinda had to piece things together from random google searches. I think I went a little overboard on the CPU but I decided im gonna host most of the internal services on that rather than the beelink. Feel free to dumpster on this, I really have no idea what I'm doing or if this is going to work lol. I hope its okay to post this here:

CPU: Ryzen 5750g PRO
RAM: Kingston KSM32ED8/32ME DDR4 3200MHz ECC DIMM
Motherboard: asrock X570 Phantom Gaming-ITX/TB3
Case: Jonsbo n1
PSU: Corsair SF750
DRIVES: 5x 16tb WD Gold drives whenever they go on sale again
Some kind of SATA expansion board

# ? Jun 30, 2022 03:47

BlankSystemDaemon: Mar 13, 2009

Korean Boomhauer posted:

Alright! I think I have some parts picked out. I kinda had to piece things together from random google searches. I think I went a little overboard on the CPU but I decided im gonna host most of the internal services on that rather than the beelink. Feel free to dumpster on this, I really have no idea what I'm doing or if this is going to work lol. I hope its okay to post this here:

CPU: Ryzen 5750g PRO
RAM: Kingston KSM32ED8/32ME DDR4 3200MHz ECC DIMM
Motherboard: asrock X570 Phantom Gaming-ITX/TB3
Case: Jonsbo n1
PSU: Corsair SF750
DRIVES: 5x 16tb WD Gold drives whenever they go on sale again
Some kind of SATA expansion board

If you want ECC memory for system stability and all the other reasons for using it, you should really consider using a board that's known to generate non-maskable interrupts on ECC errors.

Do you know if the chassis will fit a deep Mini-ITX motherboard?
If so, this should be a good option, and if not you can go with this - although the latter will require SO-DIMM ECC memory.

# ? Jun 30, 2022 06:34

fletcher: Jun 27, 2003; ken park is my favorite movie; Cybernetic Crumb

BlankSystemDaemon posted:

If you want ECC memory for system stability and all the other reasons for using it, you should really consider using a board that's known to generate non-maskable interrupts on ECC errors.

Do you know if the chassis will fit a deep Mini-ITX motherboard?
If so, this should be a good option, and if not you can go with this - although the latter will require SO-DIMM ECC memory.

The ASRock Rack also has IPMI which is handy, giving you a web based "always on" interface to manage some things remotely, like getting to the BIOS during bootup. Those motherboards also eliminate your need for a SATA expansion board.

That Jonsbo n1 looks nice but I wonder about the drive temperatures, that's definitely something I'd be keeping an eye on during your initial testing. Especially once you get a GPU in there pumping out more heat. From the pics it doesn't seem like the Jonsbo n1 would support a deep Mini-ITX.

# ? Jun 30, 2022 07:31

withoutclass: Nov 6, 2007; Resist the siren call of rhinocerosness; College Slice

+1 for IPMI, it's saved me enough that I wouldn't consider a NAS board without it myself, especially since I don't have a compatible monitor, or having to lug one down to the basement to fix something.

# ? Jun 30, 2022 14:08

Korean Boomhauer: Sep 4, 2008

BlankSystemDaemon posted:

If you want ECC memory for system stability and all the other reasons for using it, you should really consider using a board that's known to generate non-maskable interrupts on ECC errors.

Do you know if the chassis will fit a deep Mini-ITX motherboard?
If so, this should be a good option, and if not you can go with this - although the latter will require SO-DIMM ECC memory.

I think you linked the same one twice.

also oh poo poo IPMI. I for sure wanna get in on that. I do have a monitor i could use but it would be super nice to not have to use that.

# ? Jun 30, 2022 16:05

Combat Pretzel: Jun 23, 2004; No, seriously... what kurds?!

The X570D4U (--edit: yeah U not I) apparently supports ECC error detection and reporting, based on feedback I've read online.

I actually recently acquired a X570D4U with a 5750G and two of these exact Kingston DIMMs, but I haven't dicked around with purposely over-overclocking the RAM to force the issue.

And yeah, it has IPMI, too, which is nice.

--edit: Nevermind, the D4U is too large for your case. I have my NAS in a Fractal Design 7 XL.

Combat Pretzel fucked around with this message at 17:14 on Jun 30, 2022

# ? Jun 30, 2022 17:06

Korean Boomhauer: Sep 4, 2008

Dang. I was hoping to avoid having to get a gigantic case for a NAS. I don't exactly have a lot of space for a huge computer case at the moment.

# ? Jun 30, 2022 18:00

EVIL Gibson: Mar 23, 2001; Internet of Things is just someone else's computer that people can't help attaching cameras and door locks to!; Switchblade Switcharoo

To be honest, I have not seen such a strange opinion of requiring ECC on all clients for ECC to make sense on servers.

# ? Jun 30, 2022 18:03

movax: Aug 30, 2008

Korean Boomhauer posted:

Dang. I was hoping to avoid having to get a gigantic case for a NAS. I don't exactly have a lot of space for a huge computer case at the moment.

Node 804 is solid, if you can do mATX.

# ? Jun 30, 2022 18:17

Korean Boomhauer: Sep 4, 2008

movax posted:

Node 804 is solid, if you can do mATX.

I was actually looking at the 304 as well, and if it can hold a deep Mini-ITX board, that would be fine as well. I could also make a node 804 work but i'd have to move a lot of stuff around first.

# ? Jun 30, 2022 18:32

Keito: Jul 21, 2005; WHAT DO I CHOOSE ?

I built my NAS last year using an X570D4I-2T board. It's been fantastic once up and running, but potential buyers need to be aware that they come with idiosyncrasies like requiring an LGA115X CPU cooler despite being an AM4 board, or that it uses OCuLink for hooking up the HDDs. Better read the fine print if you're planning an ASRockRack system.

The Node 804 however I would really not recommend as it's like an acoustic resonance superconductor of hard drive noise, and the hard drive cages are really awkward to work with too. It's the part of my NAS that I'm the least happy with for sure.

# ? Jun 30, 2022 18:46

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

My Node 804 is in my basement and I can�t hear it over my dehumidifier running near it. Once I�m solely on the Supermicro CSE836 I might have a chance to hear what those hard drives sound like.

# ? Jun 30, 2022 19:23

Adbot: ADBOT LOVES YOU

# ? Jun 5, 2024 22:34

fletcher: Jun 27, 2003; ken park is my favorite movie; Cybernetic Crumb

withoutclass posted:

+1 for IPMI, it's saved me enough that I wouldn't consider a NAS board without it myself, especially since I don't have a compatible monitor, or having to lug one down to the basement to fix something.

I was describing how handy IPMI is to my friend, and he asked a fair question: how often are you really going into the BIOS that it even matters?

It's not very often at all that I have to use IPMI. Despite the infrequent use, it's just so handy when you do need it. Keeping the computer janitoring at home as painless and hassle free as possible goes a long ways when you have to do very similar tasks for your day job. I find it's well worth the additional up front investment, especially since I'll probably end up using the motherboard for 10+ years at home.

Also, the Node 804 makes an excellent NAS case!

# ? Jun 30, 2022 19:54

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > NAS/Storage Megathread: What is this "File Deletion" You Speak of

«‹›849 »