Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Rescue Toaster
Mar 13, 2003

Munkeymon posted:

I just found https://github.com/davestephens/ansible-nas the other night and will be giving it a try this time around

Thanks, there's some interesting stuff in here. Though I know literally nothing about ansible.

Adbot
ADBOT LOVES YOU

Saukkis
May 16, 2003

Unless I'm on the inside curve pointing straight at oncoming traffic the high beams stay on and I laugh at your puny protest flashes.
I am Most Important Man. Most Important Man in the World.

BlankSystemDaemon posted:

Well, it's enabled by default - it's sort of hinted at by the workaround being to disable it, but I wish they'd spell it out explicitly.

Also, excuse me while I'm over here, shaking my head at using implementing mdoc/groff in xml.

Yes, I read that, but that's not what it means. If you have the vfs object "fruit" enabled, then AAPL is also enabled by default. But fruit isn't enabled by default, because there is no way to disable it. Fruit-module is in use only if your configuration has a "vfs objects" stanza with the value "fruit". If that stanza is missing then fruit isn't enabled.

quote:

As a workaround remove the "fruit" VFS module from the list of
configured VFS objects in any "vfs objects" line in the Samba
configuration smb.conf

BlankSystemDaemon
Mar 13, 2009



Saukkis posted:

Yes, I read that, but that's not what it means. If you have the vfs object "fruit" enabled, then AAPL is also enabled by default. But fruit isn't enabled by default, because there is no way to disable it. Fruit-module is in use only if your configuration has a "vfs objects" stanza with the value "fruit". If that stanza is missing then fruit isn't enabled.
Even if the default smb4.conf file doesn't include the directive, if you can find me a NAS appliance which doesn't include it, I'll be impressed.

Chumbawumba4ever97
Dec 31, 2000

by Fluffdaddy
I got a couple of those 18TB Western Digital drives from the last deal.

Before I shuck my drives I like to make sure they are 100% good to go so I don't have to deal with explaining FTC rulings to WD tech support to get them to fix a shucked drive. So what I have been doing is the slow format option in Windows which I have read is a great way to test a hard drive since the long format checks every bit on the drive or something.

Unfortunately on two different computers with two different drives, the formatting just gives up after like 5 days. The drive seems to be fine but I am thinking it's simply too big to do a slow format in Windows or something.

What's the best way to test brand new drives for defects in Windows? I also tried the WD Diagnostics program but it looks like it was made in the Windows 95 days and it currently says it will take 6 days to fully check the drive.

Any other methods out there that's recommended?

e.pilot
Nov 20, 2011

sometimes maybe good
sometimes maybe shit

Smashing Link posted:

What is everyone's favorite versioned backup software + platform?

Depends entirely on what you’re backing up, I used a super convoluted script some dude made and rsync to google drive as well as an “offsite” local backup to a cheap celeron mini pc and couple external drives that live in the shed in my backyard.

Smashing Link
Jul 8, 2003

I'll keep chucking bombs at you til you fall off that ledge!
Grimey Drawer

e.pilot posted:

Depends entirely on what you’re backing up, I used a super convoluted script some dude made and rsync to google drive as well as an “offsite” local backup to a cheap celeron mini pc and couple external drives that live in the shed in my backyard.

Ideally something I can run with a local HDD destination that I can put in a 2nd location and a cloud target (Gsuite for now). I like rsync for backups to a local HDD but want something that has versioning akin to HyperBackup by Synology.

e.pilot
Nov 20, 2011

sometimes maybe good
sometimes maybe shit

Smashing Link posted:

Ideally something I can run with a local HDD destination that I can put in a 2nd location and a cloud target (Gsuite for now). I like rsync for backups to a local HDD but want something that has versioning akin to HyperBackup by Synology.

This is the rsync script I’m using in Unraid, it does versioning. Not sure if it’ll work on other platforms but it’s just linux so I’d assume so.

https://forums.unraid.net/topic/97958-rsync-incremental-backup/

Smashing Link
Jul 8, 2003

I'll keep chucking bombs at you til you fall off that ledge!
Grimey Drawer

e.pilot posted:

This is the rsync script I’m using in Unraid, it does versioning. Not sure if it’ll work on other platforms but it’s just linux so I’d assume so.

https://forums.unraid.net/topic/97958-rsync-incremental-backup/

Thanks, that looks promising. Will definitely give it a try, local HDD first.

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



Rescue Toaster posted:

Thanks, there's some interesting stuff in here. Though I know literally nothing about ansible.

I should probably have mentioned that I'm using this as an opportunity to learn it and also have a thing I can run to stand up a new NAS without re-learning a bunch of stuff every time.

Klyith
Aug 3, 2007

GBS Pledge Week

Chumbawumba4ever97 posted:

So what I have been doing is the slow format option in Windows which I have read is a great way to test a hard drive since the long format checks every bit on the drive or something.

Full format writes zeros and does a sector read check, so it takes approximately drive full write + drive full read time. That's thorough, but I don't know that on a new drive it has much value over just a normal chkdsk /R surface scan. If the drive has some sort of major return-worthy defect on the platter it should get picked up with just that.

Chumbawumba4ever97 posted:

Unfortunately on two different computers with two different drives, the formatting just gives up after like 5 days. The drive seems to be fine but I am thinking it's simply too big to do a slow format in Windows or something.

Are you using a USB 2 connection? That limits you to 50MB/s. At that speed a full format might take 200 hours. :lmao: My guess would be that it's failing because USB connections aren't perfect, and the format process is not resilient to a small interruption.

Even with USB 3 a full format of those drives might take 24-36 hours. They have sustained write at something over 200 MB/s, or at least the HC550s they are apparently based on do.

CerealKilla420
Jan 3, 2014

"I need a handle man..."

Chumbawumba4ever97 posted:

I got a couple of those 18TB Western Digital drives from the last deal.

Before I shuck my drives I like to make sure they are 100% good to go so I don't have to deal with explaining FTC rulings to WD tech support to get them to fix a shucked drive. So what I have been doing is the slow format option in Windows which I have read is a great way to test a hard drive since the long format checks every bit on the drive or something.

Unfortunately on two different computers with two different drives, the formatting just gives up after like 5 days. The drive seems to be fine but I am thinking it's simply too big to do a slow format in Windows or something.

What's the best way to test brand new drives for defects in Windows? I also tried the WD Diagnostics program but it looks like it was made in the Windows 95 days and it currently says it will take 6 days to fully check the drive.

Any other methods out there that's recommended?

This probably isn't much help but if you use the 'credit card' method of shucking the drives, you could actually put the external enclosure back together and RMA the drive as if it had never been shucked...

Not exactly ethical, but I imagine you would be able to get away with it. Fortunately none of the drives I received were defective...

Boner Wad
Nov 16, 2003
I'm running samba on my Ubuntu 20.04 "NAS" and it's slow as gently caress to copy files. For example, my Time Machine recovery transfer is between 1.5-10MBit/sec. I tried the `socket options = TCP_NODELAY` option but it's still really slow. Data is stored on zfs.

I tried iperf and it's around 300MBit/sec so I don't think it's the network.
Copied file from SSD to zfs is 120MBit/sec. (does this seems slow?)

I'm not sure what else I should look at.

BlankSystemDaemon
Mar 13, 2009



CerealKilla420 posted:

This probably isn't much help but if you use the 'credit card' method of shucking the drives, you could actually put the external enclosure back together and RMA the drive as if it had never been shucked...

Not exactly ethical, but I imagine you would be able to get away with it. Fortunately none of the drives I received were defective...
One thing to note about this is that if you do RMA it and the enclosure doesn't match the disk, WD won't accept the RMA.
So either you keep the enclosure with a label indicating which disk serial number it belongs to, or you throw out the enclosure completely.

Boner Wad posted:

I'm running samba on my Ubuntu 20.04 "NAS" and it's slow as gently caress to copy files. For example, my Time Machine recovery transfer is between 1.5-10MBit/sec. I tried the `socket options = TCP_NODELAY` option but it's still really slow. Data is stored on zfs.

I tried iperf and it's around 300MBit/sec so I don't think it's the network.
Copied file from SSD to zfs is 120MBit/sec. (does this seems slow?)

I'm not sure what else I should look at.
According to the documentation, TCP_NODELAY is the default.
Do note too, that SO_SNDBUF and SO_RCVBUF requires you to know the TCP and UDP send and receive buffer sizes that are specific to the OS you're using.

And you should also be very careful of any guides that tell you to "always use X option" without explaining exactly why, because there's a shitload of cargo-culting where someone just copies the recommendations of other people, or think that seeing a number go up in iperf over a loopback is meaningful to actual network traffic.

EDIT: MBit/sec isn't a unit. It's either MBps or Mbps.
In either case, I'd say you need to figure out what's wrong with your storage before you start working on the network.

BlankSystemDaemon fucked around with this message at 23:07 on Feb 2, 2022

Rescue Toaster
Mar 13, 2003

BlankSystemDaemon posted:

One thing to note about this is that if you do RMA it and the enclosure doesn't match the disk, WD won't accept the RMA.
So either you keep the enclosure with a label indicating which disk serial number it belongs to, or you throw out the enclosure completely.

I shucked a bunch of 8TB WD EasyStores, and marked which drive went with which enclosure. One failed almost immediately on badblocks and I felt no guilt at all re-assembling it and returning.

I guess technically I could have run badblocks on the USB 3.0 it would have just been a bit of a hassle to hook up all of them at once. Also I was within Amazon's return/swap window so I doubt it ever even made it back to WD anyway. I feel like once I load it up with data I'm probably not going to return it in the warranty period anyway so I hammer them during the return window and then just accept the loss after that.

Boner Wad
Nov 16, 2003

BlankSystemDaemon posted:

One thing to note about this is that if you do RMA it and the enclosure doesn't match the disk, WD won't accept the RMA.
So either you keep the enclosure with a label indicating which disk serial number it belongs to, or you throw out the enclosure completely.

According to the documentation, TCP_NODELAY is the default.
Do note too, that SO_SNDBUF and SO_RCVBUF requires you to know the TCP and UDP send and receive buffer sizes that are specific to the OS you're using.

And you should also be very careful of any guides that tell you to "always use X option" without explaining exactly why, because there's a shitload of cargo-culting where someone just copies the recommendations of other people, or think that seeing a number go up in iperf over a loopback is meaningful to actual network traffic.

EDIT: MBit/sec isn't a unit. It's either MBps or Mbps.
In either case, I'd say you need to figure out what's wrong with your storage before you start working on the network.

I only enabled TCP_NODELAY, didn't mess with the buffer sizes. Makes sense why it didn't help.
The file copy was 120MB/s. The iperf was 300 Mbits/sec.

I'd say both the network and storage are adequate and it's just a samba problem unless I'm missing something else here.

Saukkis
May 16, 2003

Unless I'm on the inside curve pointing straight at oncoming traffic the high beams stay on and I laugh at your puny protest flashes.
I am Most Important Man. Most Important Man in the World.

Boner Wad posted:

I only enabled TCP_NODELAY, didn't mess with the buffer sizes. Makes sense why it didn't help.
The file copy was 120MB/s. The iperf was 300 Mbits/sec.

I'd say both the network and storage are adequate and it's just a samba problem unless I'm missing something else here.

Is that 300Mbps over wireless or wired gigabit connection? It would seem quite slow for gigabit. And please use accurate units. 120MB/s means 120 megabytes per second, or almost a gigabit per second.

the spyder
Feb 18, 2011
I have 7, 72TB (8TB drive) quanta based storage nodes cycling out of production.
Each has 4xSSD and 2x SSD Boot. 10Gb NIC's. Any fun ideas?

Corb3t
Jun 7, 2003

Is there anything fun I could do with a 4-core/8-thread Intel i5 from 2015 or so? I've exclusively used Unraid with my 72 TB server, but maybe I'll try my hand at a smaller FreeNAS build? Thoughts?

BlankSystemDaemon
Mar 13, 2009



Rescue Toaster posted:

I shucked a bunch of 8TB WD EasyStores, and marked which drive went with which enclosure. One failed almost immediately on badblocks and I felt no guilt at all re-assembling it and returning.

I guess technically I could have run badblocks on the USB 3.0 it would have just been a bit of a hassle to hook up all of them at once. Also I was within Amazon's return/swap window so I doubt it ever even made it back to WD anyway. I feel like once I load it up with data I'm probably not going to return it in the warranty period anyway so I hammer them during the return window and then just accept the loss after that.
There's no reason why you should feel guilt.
So far as I understand it, both EU and US laws make it possible for consumers to partially disassemble something in order to check that it's working correctly.
On top of that, as anyone who's worked telephone helldesk can tell you, that's often a required step as part of rootcausing before an engineer can be sent out or the device will be taken in for servicing.

Boner Wad posted:

I only enabled TCP_NODELAY, didn't mess with the buffer sizes. Makes sense why it didn't help.
The file copy was 120MB/s. The iperf was 300 Mbits/sec.

I'd say both the network and storage are adequate and it's just a samba problem unless I'm missing something else here.
120MBps still seems slow even for spinning rust, depending on what kind of array you're using and what the disks are - it's also, incidentally, above where 1000BaseT Ethernet tops out when using Samba (because it transfers files via TCP).

300Mbps is where 802.11n tops out (without MIMO), and is around 36MBps - so that's your theoretical maximum.
Did you check the documentation for the other socket options? I linked it partly to show that the option you enabled is enabled by default, but also because it shows a bunch of other options that you can play with.
I'd recommend you test one at a time, and then afterwards try combining whichever options seem to make a difference.

the spyder posted:

I have 7, 72TB (8TB drive) quanta based storage nodes cycling out of production.
Each has 4xSSD and 2x SSD Boot. 10Gb NIC's. Any fun ideas?
Are those SFP+ NICs?
If so, I'd wanna look into distributed hypervisor and storage clustering - because presumably there's at least some half-decent CPU and memory in the boxes?
Although at that point, you're veering dangerously close to the homelab thread.

Gay Retard posted:

Is there anything fun I could do with a 4-core/8-thread Intel i5 from 2015 or so? I've exclusively used Unraid with my 72 TB server, but maybe I'll try my hand at a smaller FreeNAS build? Thoughts?
Seems like an excellent system to setup as an online offsite backup for your primary storage, because it shouldn't take much in the way of power while idle.

BlankSystemDaemon fucked around with this message at 09:17 on Feb 3, 2022

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
I’m going to confirm that WD rejected my RMA when I messed up, got lazy, and mixed up not just the drive and enclosure but the controller as well. So if you ever want to RMA an Easystore you need to hold onto the inner clamshell (outer one needs to only match the branding variant whether it’s EasyStore or Elements) and the matching controller. The controller board has a serial number reported as well and if it doesn’t match the drive label the RMA tech will pack it up and send it right back.

The alternative is to only keep drives and not bother with the shells at all, but it’s not 100% clear if this will work. Nobody so far has said that their RMA was rejected when putting everything back together like from the factory basically.

PitViper
May 25, 2003

Welcome and thank you for shopping at Wal-Mart!
I love you!
I didn't keep any enclosures for my shucked 12TB drives, because at $200/each, vs $279 for the normal Red Plus, I figured I'd come out ahead even if couldn't do an RMA on the bare disks.

I'm almost 2 years on that set of 4 disks without a hiccup, and soon it'll be time to swap the last 3 4TB disks from the other vdev that have about 7.5 years of runtime on them.

Raymond T. Racing
Jun 11, 2019

I'm in a Discord with people who have successfully RMA'd bare drives, but they accidentally mixed up a controller and a drive one time, and WD got pissed at them and accused them of defrauding WD.

Motronic
Nov 6, 2009

Chumbawumba4ever97 posted:

What's the best way to test brand new drives for defects in Windows? I also tried the WD Diagnostics program but it looks like it was made in the Windows 95 days and it currently says it will take 6 days to fully check the drive.

Can we get back to this specific question? Because I have a couple of drives that are new and seemingly poo poo. I can most easily put them in a USB sled and run something on windows or OSX, I can connect them for real on a windows or OSX machine or I can roll something else up if I really need to. It's just all differing levels of pain and inconvenience.

I'm still on my WD60EFZX NAS replacement epic that I've slacking on. I need to do something soon and get this crap done.

IOwnCalculus
Apr 2, 2003





Will nwipe run on OSX? That's what I use on my server to stress test any disk I'm not certain of, but I'm doing so on my Ubuntu box.

Wild EEPROM
Jul 29, 2011


oh, my, god. Becky, look at her bitrate.
Why are you swapping controllers in the first place

Thanks Ants
May 21, 2004

#essereFerrari


I don't think they're controllers as such, but the SATA to USB board.

BlankSystemDaemon
Mar 13, 2009



Motronic posted:

Can we get back to this specific question? Because I have a couple of drives that are new and seemingly poo poo. I can most easily put them in a USB sled and run something on windows or OSX, I can connect them for real on a windows or OSX machine or I can roll something else up if I really need to. It's just all differing levels of pain and inconvenience.

I'm still on my WD60EFZX NAS replacement epic that I've slacking on. I need to do something soon and get this crap done.
Part of the problem is that on Windows, there's not the same programmatic low-level access to the hardware that there is on any Unix-like.
The tools du jour like badblocks, while they work if you do WSL (I don't know if it works with WSL2, as that's uses Hyper-V and a custom Linux kernel), often make it hard to detect the errors that drives might throw when they're being burned in.

So there isn't really any way to do it right, and it's further complicated by whether the USB controller permits you to issue ATA FORMAT, unless you have USB attached SCSI in which case you can issue a SCSI FORMAT to get the true low-level format (that takes forever, as has already been observed) - but even then, you don't get the error information easily.

Wild EEPROM posted:

Why are you swapping controllers in the first place
The trick with shucking is to avoid the drives where WD has replaced the SATA controller on the drive with a USB controller - because those ones you can only use in the enclosure.

Motronic
Nov 6, 2009

BlankSystemDaemon posted:

Part of the problem is that on Windows, there's not the same programmatic low-level access to the hardware that there is on any Unix-like.
The tools du jour like badblocks, while they work if you do WSL (I don't know if it works with WSL2, as that's uses Hyper-V and a custom Linux kernel), often make it hard to detect the errors that drives might throw when they're being burned in.

So there isn't really any way to do it right, and it's further complicated by whether the USB controller permits you to issue ATA FORMAT, unless you have USB attached SCSI in which case you can issue a SCSI FORMAT to get the true low-level format (that takes forever, as has already been observed) - but even then, you don't get the error information easily.

Okay, I guess I'll have to throw something together to do it right unless I can do this while they are actually in the NAS, which seems like a bad idea.


Wild EEPROM posted:

Why are you swapping controllers in the first place

If this is being directed at me I'm not swapping controllers. This was a menu of "stuff I have laying around that would be easy to use" excluding my actual TrueNAS box and my ESXi box.

Rescue Toaster
Mar 13, 2003

Motronic posted:

Okay, I guess I'll have to throw something together to do it right unless I can do this while they are actually in the NAS, which seems like a bad idea.

Last time I had to break stuff in I just used a ubuntu live usb to run smartctl & badblocks.

Klyith
Aug 3, 2007

GBS Pledge Week

BlankSystemDaemon posted:

Part of the problem is that on Windows, there's not the same programmatic low-level access to the hardware that there is on any Unix-like.
The tools du jour like badblocks, while they work if you do WSL (I don't know if it works with WSL2, as that's uses Hyper-V and a custom Linux kernel), often make it hard to detect the errors that drives might throw when they're being burned in.

So there isn't really any way to do it right, and it's further complicated by whether the USB controller permits you to issue ATA FORMAT, unless you have USB attached SCSI in which case you can issue a SCSI FORMAT to get the true low-level format (that takes forever, as has already been observed) - but even then, you don't get the error information easily.

The question I'd ask is whether a low-level format is any better at predicting drive health than a high-level one?

Also, I'm not seeing where badblocks is doing anything more "low level" than windows format. Badblocks writes 4 different data patterns to every sector of the drive, and verifies. There's an optional switch /P: for format that writes additional passes of random data after the first zeroing. So format /P:3 would seem to me to be highly equivalent to badblocks (and just as slow). Both are targeting the standard LBA sectors on the drive. Why is one better?



...

In the youtuber thread BSD linked the backblaze reliability results. One thing that's been pretty established by most HDD studies is that they've never found a perfect diagnostic for predicting failure. Even among failed drives, a significant number don't give warning with SMART errors. Bad sectors, independent of anything else, are not a good predictor of failure.

The second thing is that HDDs in the past 10 years don't have much of a bathtub curve -- the initial failure rate isn't really elevated over the year 2 rate. So to me the general trends for HDD reliability say there is no magic bullet to predict whether a brand-new drive will be long term good. Install it and write a bunch of data, do standard data-integrity stuff at the high level. Maybe when they get close to their warranty expiration check the SMART errors, maybe run the OEM's utility that judges if it counts for an in-warranty failure?

I dunno, this seems like the inherent tradeoff for buying & shucking cheap externals that have less warranty. That money you saved is your cold spare.

Klyith fucked around with this message at 16:46 on Feb 4, 2022

Fancy_Lad
May 15, 2003
Would you like to buy a monkey?
When I was using Windows + DrivePool, I had a copy of Hard Disk Sentinel that I'd use for burn-in on a new drive. The UI looked straight out of w2k, but it had the option to do various types of writes with read verification. This is from memory, but I believe I used to do a quick smart test, extended smart test, full sequential random data write/read, a butterfly write/read, then an all zeros in random order write/read. On 3ish TB drives those tests would take over a week if memory serves, but I do specifically remember one of the last drives I tested, a 6TB Toshiba, died during the butterfly test and I was able to do a return with the store on it.

Just looked, and it looks like the software got an update at the start of this year, although the UI still looks like the same lovable dumpster fire :D
https://www.hdsentinel.com/

Edit: If anyone cares, these days I'm on UnRAID and typically do a 3 cycle pre-clear with post validation to burn in - and about half the time I get annoyed at how long it is taking and stop it sometime in the middle of the second cycle :)

Fancy_Lad fucked around with this message at 16:54 on Feb 4, 2022

BlankSystemDaemon
Mar 13, 2009



Klyith posted:

The question I'd ask is whether a low-level format is any better at predicting drive health than a high-level one?

Also, I'm not seeing where badblocks is doing anything more "low level" than windows format. Badblocks writes 4 different data patterns to every sector of the drive, and verifies. There's an optional switch /P: for format that writes additional passes of random data after the first zeroing. So format /P:3 would seem to me to be highly equivalent to badblocks (and just as slow). Both are targeting the standard LBA sectors on the drive. Why is one better?



...

In the youtuber thread BSD linked the backblaze reliability results. One thing that's been pretty established by most HDD studies is that they've never found a perfect diagnostic for predicting failure. Even among failed drives, a significant number don't give warning with SMART errors. Bad sectors, independent of anything else, are not a good predictor of failure.

The second thing is that HDDs in the past 10 years don't have much of a bathtub curve -- the initial failure rate isn't really elevated over the year 2 rate. So to me the general trends for HDD reliability say there is no magic bullet to predict whether a brand-new drive will be long term good. Install it and write a bunch of data, do standard data-integrity stuff at the high level. Maybe when they get close to their warranty expiration check the SMART errors, maybe run the OEM's utility that judges if it counts for an in-warranty failure?

I dunno, this seems like the inherent tradeoff for buying & shucking cheap externals that have less warranty. That money you saved is your cold spare.
What do you mean by a high-level format? A quick format in Windows just consists of wiping the MBR or GPT from the beginning and end of the LBA range while trying to avoid the host-protected LBAs, as well as getting rid of the Master File Table if you're using NTFS.

The low-level actions I'm talking about are things like SCSI FORMAT UNIT or SCSI SANITIZE or ATA SECURE ERASE or ATA SANITIZE, whereas my concept of a high-level is something equivalent to DBAN where some pattern (whether it's zeroes, ones, random, or multiple iterations of any or a combination of those) is written using the WRITE operations.

The main point, though, is ensuring that every single sector of the disk gets written to at least once, and that you have some way of correlating that action with an error that occurs.
FreeBSD does this by having no block devices (which are a kind of device that always writes data in blocks), but on Linux you have to ask for that explicitly with raw(8) and even then I don't know that Linux has anything like GEOM which will always report where an error happens (rather than you having to probe S.M.A.R.T for the data, in the hopes that the firmware actually reports it). On Windows, there's no way I know of to avoid all the caching that Windows tries to do to speed up things.

The main advantage of badblocks is that that part about verification. Outside of a filesystem with checksums and a scrub command to verify all copies of a block, I don't know of a way to get that.
The low-level format, while it does cover the whole disk, doesn't verify anything.

And no, there's no real way to empirically know when a disk is going to die - because even if the firmware doesn't lie (including in the S.M.A.R.T data), you're still fundamentally relying on statistical correlation - and one always has to account for statistical outliers.

I'm a little more skeptical of BackBlaze's claims that there's less of a bathtub curve now than there ever was - I can see how their use of the disks, which isn't how most people use disks, could impact that more than it impacts the raw statistical data for annualized failure rates.

Fancy_Lad posted:

When I was using Windows + DrivePool, I had a copy of Hard Disk Sentinel that I'd use for burn-in on a new drive. The UI looked straight out of w2k, but it had the option to do various types of writes with read verification. This is from memory, but I believe I used to do a quick smart test, extended smart test, full sequential random data write/read, a butterfly write/read, then an all zeros in random order write/read. On 3ish TB drives those tests would take over a week if memory serves, but I do specifically remember one of the last drives I tested, a 6TB Toshiba, died during the butterfly test and I was able to do a return with the store on it.

Just looked, and it looks like the software got an update at the start of this year, although the UI still looks like the same lovable dumpster fire :D
https://www.hdsentinel.com/

Edit: If anyone cares, these days I'm on UnRAID and typically do a 3 cycle pre-clear with post validation to burn in - and about half the time I get annoyed at how long it is taking and stop it sometime in the middle of the second cycle :)
I really hate to harp on about this, but S.M.A.R.T data is only as good as the manufacturer thinks they can get away with, because they've got a monetary incentive to avoid having you return a drive before it's outside its warranty.

wolrah
May 8, 2006
what?
How long is normal for a badblocks run?

I've never done it before and just YOLO'd hard drives in to my system since I usually have a space crunch to resolve, but for a change I wasn't in a squeeze so I decided to give it a shot before adding another four drives to my pool.

72 hours later I'm here:

code:
➜  ~ sudo badblocks -wsv -b 4096 /dev/sda
Checking for bad blocks in read-write mode
From block 0 to 2929721343
Testing with pattern 0xaa: done
Reading and comparing: #ddone
Testing with pattern 0x55: done
Reading and comparing:  47.36% done, 71:05:37 elapsed. (0/0/0 errors)
This is on a 12TB shucked WD.

I'm certainly glad I ran this in a screen session so I could at least detach from it, but at this rate it's looking like it's going to take a solid week to finish this one drive while three more have yet to run. I guess I can run the other three in parallel but I don't know how much that'll impact the performance of the other vdev that shares a SAS controller.

Klyith
Aug 3, 2007

GBS Pledge Week
Sorry I'm chopping your post up a bit to handle two different topics:

BlankSystemDaemon posted:

What do you mean by a high-level format? A quick format in Windows just consists of wiping the MBR or GPT from the beginning and end of the LBA range while trying to avoid the host-protected LBAs, as well as getting rid of the Master File Table if you're using NTFS.

The main point, though, is ensuring that every single sector of the disk gets written to at least once, and that you have some way of correlating that action with an error that occurs.

The main advantage of badblocks is that that part about verification. Outside of a filesystem with checksums and a scrub command to verify all copies of a block, I don't know of a way to get that.

Um, I'm explicitly talking about the non-quick format? Which does, since windows 7, write & read data to the entire partition you are formatting, which will verify each sector. It reports bad sector errors if any, though you don't get badblock's exact sector #### report. (Though my belief is modern drives transparently remap sectors if they fail on write -- so using either windows format or badblocks you'd want to also look at smart attributes and see those stats.)

BlankSystemDaemon posted:

The low-level actions I'm talking about are things like SCSI FORMAT UNIT or SCSI SANITIZE or ATA SECURE ERASE or ATA SANITIZE, whereas my concept of a high-level is something equivalent to DBAN where some pattern (whether it's zeroes, ones, random, or multiple iterations of any or a combination of those) is written using the WRITE operations.

I'm pretty sure that stuff is accessible on Windows, though much less easily than a *nix of course. Ata Secure erase for example -- drive manufacturer software for windows can do that from within windows. I could use WD Dashboard or Samsung Magician to ata secure erase my SSDs right now. Windows doesn't stop it.

The software ecosystem is much more limited and less utilitarian than *nix, that's very true. But I don't think windows locks out anything low-level. You can issue these commands if you write software that interacts with the drivers at that level. You can avoid the caching and normal storage APIs if you want to. But again, ecosystem. Commercial software on windows is where that stuff mostly exists.

So yeah, someone with the knowledge to apply low-level commands to drives can do so more easily on *nix. I still question whether that is a particularly useful ability, or that doing so has any relationship to judging drive reliability.


wolrah posted:

How long is normal for a badblocks run?

It does 4 read&write patterns, and you've done about 1 and 3/4ths. So ~90 hours to go.

BlankSystemDaemon
Mar 13, 2009



wolrah posted:

How long is normal for a badblocks run?

I've never done it before and just YOLO'd hard drives in to my system since I usually have a space crunch to resolve, but for a change I wasn't in a squeeze so I decided to give it a shot before adding another four drives to my pool.

72 hours later I'm here:

code:
➜  ~ sudo badblocks -wsv -b 4096 /dev/sda
Checking for bad blocks in read-write mode
From block 0 to 2929721343
Testing with pattern 0xaa: done
Reading and comparing: #ddone
Testing with pattern 0x55: done
Reading and comparing:  47.36% done, 71:05:37 elapsed. (0/0/0 errors)
This is on a 12TB shucked WD.

I'm certainly glad I ran this in a screen session so I could at least detach from it, but at this rate it's looking like it's going to take a solid week to finish this one drive while three more have yet to run. I guess I can run the other three in parallel but I don't know how much that'll impact the performance of the other vdev that shares a SAS controller.
That seems entirely within reason for a 12TB disk.

While USB3 is still packet-based, it is full-duplex and has 128/130b encoding (as opposed to half-duplex 8/10b encoding of USB2) - so you're if you've got three devices moving traffic at about 160MBps, that still puts you just shy of the ~550MBps that you can reasonably expect.

Klyith posted:

Sorry I'm chopping your post up a bit to handle two different topics:

Um, I'm explicitly talking about the non-quick format? Which does, since windows 7, write & read data to the entire partition you are formatting, which will verify each sector. It reports bad sector errors if any, though you don't get badblock's exact sector #### report. (Though my belief is modern drives transparently remap sectors if they fail on write -- so using either windows format or badblocks you'd want to also look at smart attributes and see those stats.)

I'm pretty sure that stuff is accessible on Windows, though much less easily than a *nix of course. Ata Secure erase for example -- drive manufacturer software for windows can do that from within windows. I could use WD Dashboard or Samsung Magician to ata secure erase my SSDs right now. Windows doesn't stop it.

The software ecosystem is much more limited and less utilitarian than *nix, that's very true. But I don't think windows locks out anything low-level. You can issue these commands if you write software that interacts with the drivers at that level. You can avoid the caching and normal storage APIs if you want to. But again, ecosystem. Commercial software on windows is where that stuff mostly exists.

So yeah, someone with the knowledge to apply low-level commands to drives can do so more easily on *nix. I still question whether that is a particularly useful ability, or that doing so has any relationship to judging drive reliability.

It does 4 read&write patterns, and you've done about 1 and 3/4ths. So ~90 hours to go.
I think we're still talking about different things; badblocks writes to the disk and then verifies the pattern that it's written by reading it back, which is not the same as Windows 7 formatting by writing a pattern to the disk which just verifies that each sector can be written to.

ATA Secure Erase (and all the others, if memory serves) tells the drive firmware to perform the operation, it doesn't rely on anything the OS is doing and so long as the drive is powered (for example if it's in an external enclosure), it'll still continue if you shut down the host system.

Just because you can send those commands doesn't mean Windows gives you access to the device driver without doing some caching in between the device driver and wherever you're issuing the commands from.
That caching makes it impossible to correlate a particular any single write operation with any error that might be returned which identifies where on the physical disk the error is.

wolrah
May 8, 2006
what?

BlankSystemDaemon posted:

That seems entirely within reason for a 12TB disk.

While USB3 is still packet-based, it is full-duplex and has 128/130b encoding (as opposed to half-duplex 8/10b encoding of USB2) - so you're if you've got three devices moving traffic at about 160MBps, that still puts you just shy of the ~550MBps that you can reasonably expect.

These drives are already shucked and sitting in my server connected to a LSI controller so USB isn't involved anymore, just wasn't sure whether it was normal for modern disks to take this long. I guess I should probably start the scan on the other drives.

BlankSystemDaemon
Mar 13, 2009



wolrah posted:

These drives are already shucked and sitting in my server connected to a LSI controller so USB isn't involved anymore, just wasn't sure whether it was normal for modern disks to take this long. I guess I should probably start the scan on the other drives.
You're running into the same thing that's the cause of the recommendation to avoid hardware RAID5: The bandwidth of spinning rust has not kept up with the incredible increase in capacity.

Remember that doing ~100MBps on spinning rust was not unheard of, when disks were 10s of GB. Now they're three orders of magnitude bigger, yet the same speed.

Moey
Oct 22, 2010

I LIKE TO MOVE IT
I have used "WD Drive utilities" before shucking for the past 8 easystore drives. Run a full scan on them (took like 28 hours maybe) per drive with my most recent 14tb disks.

Never found any errors, have not had any issues since shucking (knocks on wood).

Sir Bobert Fishbone
Jan 16, 2006

Beebort
I have what looks on its face to be the dumbest SMART error. This is the smartctl output for a 1TB Gigabyte nvme drive on my Proxmox server, which has begun emailing me every day that the drive is failing. The room that box is in is somewhat chilly, sure, but certainly not as cold as the drive seems to think it is.

Is there a way to suppress this error and keep smartd/smartctl from yelling at me?

code:
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- temperature is above or below threshold

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x02
Temperature:                        1 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    2%
Data Units Read:                    14,487,976 [7.41 TB]
Data Units Written:                 23,213,323 [11.8 TB]
Host Read Commands:                 232,577,969
Host Write Commands:                704,986,803
Controller Busy Time:               17,643
Power Cycles:                       74
Power On Hours:                     13,296
Unsafe Shutdowns:                   33
Media and Data Integrity Errors:    0
Error Information Log Entries:      123
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Adbot
ADBOT LOVES YOU

BlankSystemDaemon
Mar 13, 2009



Drive's definitely chilling out.

I'm pretty sure you can turn off S.M.A.R.T if all else fails, but I would hope that Proxmox has the option, somewhere, to disable individual warnings unless the state changes further.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply