Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
PRADA SLUT
Mar 14, 2006

Inexperienced,
heartless,
but even so
Does anyone know how I would diagnose a stuttering on a Plex video from a Synology NAS?

It only applies to some videos so I think there's a change in how I encoded them, but I'm not sure exactly what parameters I should look for (e.g., if the server can't decode some video format natively and is trying to transcode it, if there's an audio stream transcoding, if the server is doing something else as well, etc).

I'm pretty sure it's not a bandwidth issue, but I could check that as well.

Adbot
ADBOT LOVES YOU

Thanks Ants
May 21, 2004

#essereFerrari


Look at the stats on whatever you're playing the video on - it will tell you if you're hitting CPU limits, running out of RAM etc.

It's unlikely to be bandwidth but I had a nightmare playing some content back on an old Raspberry Pi 3 over Wi-Fi because the wireless radio is so incredibly bad on that device.

IOwnCalculus
Apr 2, 2003





School of How posted:

I had planned on using motherboard RAID. Is it really that bad? My motherboard is a B550 motherboard that I bought in 2020, so the tech on it should be pretty recent. How do I go about setting up OS RAID on ubuntu?

If you're using Ubuntu there is no reason you should run whatever RAID implementation your motherboard has. The RAID1 your motherboard creates might be portable or readable on other systems, it might not. There's absolutely no standardization with that sort of thing.

The best option would be ZFS, since it's massively portable across most flavors of Linux as well as BSD, and it's also more reliable than other solutions. Especially if your use case is just a mirrored pair of disks.

PRADA SLUT
Mar 14, 2006

Inexperienced,
heartless,
but even so

Thanks Ants posted:

Look at the stats on whatever you're playing the video on - it will tell you if you're hitting CPU limits, running out of RAM etc.

It's unlikely to be bandwidth but I had a nightmare playing some content back on an old Raspberry Pi 3 over Wi-Fi because the wireless radio is so incredibly bad on that device.

Where do you see that? It happens with the Mac Plex app and the AppleTV Plex app, but the file itself is okay, even when opened as a file over the network in VLC.

Thanks Ants
May 21, 2004

#essereFerrari


This was in Kodi using their terrible SMB client so I was probably asking for the performance issues

Edit: Oh, you were asking where to see the stats. I think you're probably SOL for the Apple TV app. I really doubt you're having issues with your Mac not being quick enough either, so I would look at whether you're transcoding that video. Synology are really good at putting pretty anaemic hardware in their NAS boxes considering the price, so unless you specifically have one designed to transcode (and Plex can make use of that), you're probably running into the limits of what a six year old Atom is capable of.

Thanks Ants fucked around with this message at 23:57 on Feb 10, 2022

withoutclass
Nov 6, 2007

Resist the siren call of rhinocerosness

College Slice

PRADA SLUT posted:

Where do you see that? It happens with the Mac Plex app and the AppleTV Plex app, but the file itself is okay, even when opened as a file over the network in VLC.

Check out the dashboard on your Plex server when streaming.

PRADA SLUT
Mar 14, 2006

Inexperienced,
heartless,
but even so

withoutclass posted:

Check out the dashboard on your Plex server when streaming.

On the dashboard, it's listed as:

Video:
4K (HEVC Main 10 HDR)
Direct Play

Audio:
English (AC3 5.1)
Direct Play

The little box claims 31Mbps and I normally see about that much, but sometimes the bandwidth spikes to 350Mbps. That shouldn't matter, since I'm wired on a Gigabit connection, and I get 940Mbps to the internet (if that matters). Is it like a disk speed thing? I have two 10TB 7200RPM IronWolfs.

DS218+
2-core 2GHZ INTEL Celeron J3355
10 GB RAM

e: bandwidth spikes don't correlate with lag / frame drops. They seem to happen independently (or not) of the video.

PRADA SLUT fucked around with this message at 01:04 on Feb 11, 2022

Klyith
Aug 3, 2007

GBS Pledge Week

School of How posted:

I had planned on using motherboard RAID. Is it really that bad? My motherboard is a B550 motherboard that I bought in 2020, so the tech on it should be pretty recent. How do I go about setting up OS RAID on ubuntu?

It's not terribly bad, there's just zero advantage to it other than easy setup. Meanwhile it does have some downsides: lack of portability, lack of good feedback and error logging.

For example, lets say your mobo dies. You can move the drives to a new mobo, but the process for restoring a functional mirror is to pick one drive as the "good" one, turn on mirroring, and wait for the mobo to rebuild the raid by getting everything from the "good" drive. So for one full drive write you have to just guess "does one of your drives actually have errors on it?"


Here are instructions for creating a ZFS mirror set on ubuntu. You should probably use that.

School of How
Jul 6, 2013

quite frankly I don't believe this talk about the market
The more I think about it, I think what I shoiuld go with is one of these: https://www.newegg.com/asustor-as1104t/p/N82E16822225067

I'd rather not have to always be loving with the command line to get my files working,. I'd rather just have a device I put drives into. and it just sits there and work with little hassle on my part. Does anybody know of a model/brand NAS enclosure that stands out as the best? I already have a nVidia Sheid, so I don't need video streaming capabilities. I just need something that will just serve files. It would be nice if there was a sleep feature, since I'm only going to be using the server a few hours a day. I'd prefer it to spin off the drives when they are not in use, so it isn't burning electricity when it's not in use.

School of How fucked around with this message at 02:45 on Feb 11, 2022

Klyith
Aug 3, 2007

GBS Pledge Week

School of How posted:

The more I think about it, I think what I shoiuld go with is one of these: https://www.newegg.com/asustor-as1104t/p/N82E16822225067

I'd rather not have to always be loving with the command line to get my files working,. I'd rather just have a device I put drives into. and it just sits there and work with little hassle on my part.

Yeah since you said you were running a linux machine I figured were comfortable there. If not, and you have the extra money, just get a NAS box. Plug in drives and forget all this poo poo.


That asus one you linked is very cheap for a 4-bay NAS: that's because it's got 4 drive bays and no frills. Cheap ARM processor, 1GB ram, etc. If all you care about is the basic file storage it's just fine. If you want to do things like run a plex server on your NAS-box or other multipurpose things you should aim to spend more, or look at 2-bay devices like the synology DS220+ that are about the same price.

Any commercial NAS box will have the ability to automatically spin down the disks.

mmkay
Oct 21, 2010

School of How posted:

edit: just found a guide: https://kifarunix.com/setup-software-raid-on-ubuntu-20-04/, holy crap that looks like too much work, and will likely break at some point in the future.

This looks like 4 commands, only half of which are specific to RAID (and one of them is just installing the RAID software)?

r u ready to WALK
Sep 29, 2001

i would not use the old md/dm mirroring in linux when zfs is available, even if Torvalds still says "don't use zfs" because of the licensing.
it's just so much easier to set up stuff like snapshots on zfs, you can even make them show up as previous versions on windows clients: https://github.com/zfsonlinux/zfs-auto-snapshot/wiki/Samba

IOwnCalculus
Apr 2, 2003





Agreed, it's 2022, don't use mdraid. Even if you never snapshot or anything like that, it's more reliable.

On a running Ubuntu server you're only looking at three commands to get your ZFS array up, and the second one is just to get the paths to your specific drives.

code:
sudo apt install zfsutils-linux
ls /dev/disk/by-id
sudo zpool create tank mirror /dev/disk/by-id/path-to-drive-1 /dev/disk/by-id/path-to-drive-2
That's it, you have a mirror mounted at /tank now. It would also be a good idea to set up regular scrubs of the pool by adding something like this to your root crontab:
code:
0 0 1 * * /sbin/zpool scrub tank

Chumbawumba4ever97
Dec 31, 2000

by Fluffdaddy

PRADA SLUT posted:

Where do you see that? It happens with the Mac Plex app and the AppleTV Plex app, but the file itself is okay, even when opened as a file over the network in VLC.

AppleTV is known to be really lovely with Plex. AppleTV doesn't even support TrueHD/Atmos audio so there's always AppleTV users complaining that they don't hear audio in Plex.

bobfather
Sep 20, 2001

I will analyze your nervous system for beer money

Chumbawumba4ever97 posted:

AppleTV is known to be really lovely with Plex. AppleTV doesn't even support TrueHD/Atmos audio so there's always AppleTV users complaining that they don't hear audio in Plex.

That’s... untrue? In fact, ATV can flawlessly match color space and frame rate. Sure, it transodes TrueHD to 7.1 PCM, but that still works fine for anyone without Atmos height channels.

BlackMK4
Aug 23, 2006

wat.
Megamarm
I've been really happy with my ATVs and plex for years and years. I do use Infuse (https://firecore.com/infuse) on my AppleTV 4k since super high bitrate 4k content can sometimes not play right in the Plex app.

e: quoting for visibility

PRADA SLUT posted:

Does anyone know how I would diagnose a stuttering on a Plex video from a Synology NAS?

This is what mine did when I say "not play right"

BlackMK4 fucked around with this message at 21:56 on Feb 11, 2022

CopperHound
Feb 14, 2012

Is there a way I can tell if this is a physical problem with a drive vs bad cable or something?
code:
  pool: pool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 09:17:56 with 0 errors on Sun Jan 30 18:34:17 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        pool                                            DEGRADED     0     0 0
          raidz2-0                                      DEGRADED     0     0 0
            gptid/f705a0ef-5dce-11ec-bda2-002590decd63  ONLINE       0     0 0
            gptid/f88eb502-5dce-11ec-bda2-002590decd63  ONLINE       0     0 0
            gptid/d2b88424-5cbc-11ec-bda2-002590decd63  FAULTED    128   245 0  too many errors
            gptid/d4b42333-5cbc-11ec-bda2-002590decd63  ONLINE       0     0 0
            gptid/d6339699-5cbc-11ec-bda2-002590decd63  ONLINE       0     0 0
            gptid/d7221c2f-5cbc-11ec-bda2-002590decd63  ONLINE       0     0 0
            gptid/d819727c-5cbc-11ec-bda2-002590decd63  ONLINE       0     0 0
            gptid/d9215ba7-5cbc-11ec-bda2-002590decd63  ONLINE       0     0 0
        cache
          gptid/a8022ab5-69ed-11ec-877e-002590decd63    ONLINE       0     0 0
Here are my SMART results
code:
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p11 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate BarraCuda 3.5
Device Model:     ST10000DM0004-1ZC101
Serial Number:    ZA2E1RBL
LU WWN Device Id: 5 000c50 0c845465a
Firmware Version: DN01
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Feb 11 08:48:51 2022 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  567) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 835) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   078   064   044    Pre-fail  Always       -       65897984
  3 Spin_Up_Time            0x0003   089   089   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       71
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   088   060   045    Pre-fail  Always       -       564858735
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       3422 (165 204 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       40
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   084   084   000    Old_age   Always       -       16
190 Airflow_Temperature_Cel 0x0022   073   052   040    Old_age   Always       -       27 (Min/Max 20/33)
191 G-Sense_Error_Rate      0x0032   099   099   000    Old_age   Always       -       3926
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       31
193 Load_Cycle_Count        0x0032   099   099   000    Old_age   Always       -       2099
194 Temperature_Celsius     0x0022   027   048   000    Old_age   Always       -       27 (0 18 0 0 0)
195 Hardware_ECC_Recovered  0x001a   004   001   000    Old_age   Always       -       65897984
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       2962h+51m+07.095s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       55898212113
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       189558020462

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      3422         -
# 2  Extended offline    Completed without error       00%      3112         -
# 3  Short offline       Completed without error       00%      3098         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Methylethylaldehyde
Oct 23, 2004

BAKA BAKA

CopperHound posted:

Is there a way I can tell if this is a physical problem with a drive vs bad cable or something?


195 Hardware_ECC_Recovered 0x001a 004 001 000 Old_age Always - 65897984

Suggests that the drive itself might be having issues, since that's a number that would be independent of any cabling issues.

Klyith
Aug 3, 2007

GBS Pledge Week

CopperHound posted:

Is there a way I can tell if this is a physical problem with a drive vs bad cable or something?

Here are my SMART results

Of all the SMART stuff that backblaze thinks are relevant to drive death, you have zeros in all of them. The only non-zero thing that might be of note is 189 high fly writes, and that's only really bad if you get a cluster of them. So if those 16 high fly events were spread out over the whole drive's operating life, it's probably fine. Seagate's definitely not gonna replace it.


I'd say drive is probably not the problem, so maybe replace the cable, mark it clear, and see if it stops producing errors? Or pull the drive and give it a full test on a separate machine, if you want to be sure?


Methylethylaldehyde posted:

195 Hardware_ECC_Recovered 0x001a 004 001 000 Old_age Always - 65897984

Suggests that the drive itself might be having issues, since that's a number that would be independent of any cabling issues.

Hardware_ECC_Recovered is not an incrementing number of events, it's the time between events. High is good.

CopperHound
Feb 14, 2012

Klyith posted:

I'd say drive is probably not the problem, so maybe replace the cable, mark it clear, and see if it stops producing errors? Or pull the drive and give it a full test on a separate machine, if you want to be sure?
In my haste I forgot to say what I tried so far. I marked it clear a couple weeks ago and it went through a resilver scrub twice without errors, but they just came back yesterday.

For now, if there isn't anything that looks like a definite hardware failure, I'll swap bays and see if it follows the drive before buying another sas to 4x sata cable.

CopperHound fucked around with this message at 19:17 on Feb 11, 2022

BlankSystemDaemon
Mar 13, 2009




This is explicitly marked as a pre-fail attribute:
pre:
Seek_Error_Rate         0x000f   088   060   045    Pre-fail  Always       -       564858735
This indicates that it's what the manufacturer thinks indicates when a drive could befailing, and in addition to that, there's also evidence that it isn't the cable both from the fact that it's both the read and write columns and they're not low-digits (unless it's been growing steadily over a long time), in addition to this (which is also a pre-fail attribute):
pre:
Raw_Read_Error_Rate     0x000f   078   064   044    Pre-fail  Always       -       65897984

Methylethylaldehyde posted:

195 Hardware_ECC_Recovered 0x001a 004 001 000 Old_age Always - 65897984

Suggests that the drive itself might be having issues, since that's a number that would be independent of any cabling issues.
Hardware ECC Recovered being that high is also worrying, yeah - but at least that part's actually working (in that the errors are being recovered, as opposed to not).

CopperHound posted:

In my haste I forgot to say what I tried so far. I marked it clear a couple weeks ago and it went through a resilver twice without errors, but they just came back yesterday.

For now, if there isn't anything that looks like a definite hardware failure, I'll swap bays and see if it follows the drive before buying another sas to 4x sata cable.
The drive is absolutely in the process of failing.

EDIT: Doesn't /var/log/messages contain tons of messages from CAM about READs and WRITEs having failed?

BlankSystemDaemon fucked around with this message at 18:54 on Feb 11, 2022

CopperHound
Feb 14, 2012

BlankSystemDaemon posted:

EDIT: Doesn't /var/log/messages contain tons of messages from CAM about READs and WRITEs having failed?
I'm not sure why most of the messages refer to da1 before kicking da0 out of the pool.
code:
Feb 11 09:36:49 nas mps0: Controller reported scsi ioc terminated tgt 1 SMID 1887 loginfo 31110d00
Feb 11 09:36:49 nas mps0: Controller reported scsi ioc terminated tgt 1 SMID 263 loginfo 31110d00
Feb 11 09:36:49 nas (da1:mps0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 02 38 0c 1b e0 00 00 00 08 00 00
Feb 11 09:36:49 nas (da1:mps0:0:1:0): CAM status: CCB request completed with anerror
Feb 11 09:36:49 nas (da1:mps0:0:1:0): Retrying command, 3 more tries remain
Feb 11 09:36:49 nas (da1:mps0:0:1:0): READ(16). CDB: 88 00 00 00 00 01 af 6c 2c10 00 00 00 28 00 00
Feb 11 09:36:49 nas (da1:mps0:0:1:0): CAM status: CCB request completed with anerror
Feb 11 09:36:49 nas (da1:mps0:0:1:0): Retrying command, 3 more tries remain
Feb 11 09:36:49 nas mps0: Controller reported scsi ioc terminated tgt 1 SMID 799 loginfo 31170000
Feb 11 09:36:49 nas mps0: Controller reported scsi ioc terminated tgt 1 SMID 1087 loginfo 31170000
Feb 11 09:36:49 nas (da1:mps0:0:1:0): READ(16). CDB: 88 00 00 00 00 01 af 6c 2c10 00 00 00 28 00 00
Feb 11 09:36:49 nas (da1:mps0:0:1:0): CAM status: CCB request completed with anerror
Feb 11 09:36:49 nas (da1:mps0:0:1:0): Retrying command, 2 more tries remain
Feb 11 09:36:49 nas (da1:mps0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 02 38 0c 1b e0 00 00 00 08 00 00
Feb 11 09:36:49 nas (da1:mps0:0:1:0): CAM status: CCB request completed with anerror
Feb 11 09:36:49 nas (da1:mps0:0:1:0): Retrying command, 2 more tries remain
Feb 11 09:36:49 nas mps0: mpssas_prepare_remove: Sending reset for target ID 1
Feb 11 09:36:49 nas da1 at mps0 bus 0 scbus0 target 1 lun 0
Feb 11 09:36:49 nas da1: <ATA WDC WD80EZAZ-11T 0A83>  s/n 7SJ7HVAW             detached
Feb 11 09:36:49 nas mps0: No pending commands: starting remove_device
Feb 11 09:36:49 nas (da1:mps0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 02 38 0c 1b e0 00 00 00 08 00 00
Feb 11 09:36:49 nas (da1:mps0:0:1:0): CAM status: CCB request aborted by the host
Feb 11 09:36:49 nas (da1:mps0:0:1:0): Error 5, Periph was invalidated
Feb 11 09:36:49 nas (da1:mps0:0:1:0): READ(16). CDB: 88 00 00 00 00 01 af 6c 2c10 00 00 00 28 00 00
Feb 11 09:36:49 nas (da1:mps0:0:1:0): CAM status: CCB request aborted by the host
Feb 11 09:36:49 nas mps0: (da1:mps0:0:1:0): Error 5, Periph was invalidated
Feb 11 09:36:49 nas Unfreezing devq for target ID 1
Feb 11 09:36:50 nas (da1:mps0:0:1:0): Periph destroyed
Feb 11 09:36:52 nas mps0: Controller reported scsi ioc terminated tgt 0 SMID 1681 loginfo 31110d00
Feb 11 09:36:52 nas mps0: Controller reported scsi ioc terminated tgt 0 SMID 1797 loginfo 31110d00
Feb 11 09:36:52 nas mps0: mpssas_prepare_remove: Sending reset for target ID 0
Feb 11 09:36:52 nas da0 at mps0 bus 0 scbus0 target 0 lun 0
Feb 11 09:36:52 nas da0: <ATA ST10000DM0004-1Z DN01>  s/n ZA2E1RBL detached
Feb 11 09:36:52 nas mps0: No pending commands: starting remove_device
Feb 11 09:36:52 nas mps0: Unfreezing devq for target ID 0
Feb 11 09:36:52 nas xptioctl: pass driver is not in the kernel
Feb 11 09:36:52 nas xptioctl: put "device pass" in your kernel config file
Feb 11 09:36:55 nas (da0:mps0:0:0:0): Periph destroyed

Sir Bobert Fishbone
Jan 16, 2006

Beebort

Sir Bobert Fishbone posted:

I have what looks on its face to be the dumbest SMART error. This is the smartctl output for a 1TB Gigabyte nvme drive on my Proxmox server, which has begun emailing me every day that the drive is failing. The room that box is in is somewhat chilly, sure, but certainly not as cold as the drive seems to think it is.

Is there a way to suppress this error and keep smartd/smartctl from yelling at me?

code:
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- temperature is above or below threshold

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x02
Temperature:                        1 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    2%
Data Units Read:                    14,487,976 [7.41 TB]
Data Units Written:                 23,213,323 [11.8 TB]
Host Read Commands:                 232,577,969
Host Write Commands:                704,986,803
Controller Busy Time:               17,643
Power Cycles:                       74
Power On Hours:                     13,296
Unsafe Shutdowns:                   33
Media and Data Integrity Errors:    0
Error Information Log Entries:      123
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

I migrated my entire VM stack from Proxmox to ESXi free edition this week, and this drive is absolutely failing in weird and wonderful ways. It lasts all of 2-10 minutes after reboot before VMWare decides it's inaccessible and shuts it down. I installed Windows on that box bare-metal just so I could run Gigabyte's special SSD tool, and that tool reports everything as being totally fine, including normal temperature readings. Going to have to call it a loss and move on, I think. Just wish it behaved consistently.

ComWalk
Mar 4, 2007

BlankSystemDaemon posted:

This is explicitly marked as a pre-fail attribute:
pre:
Seek_Error_Rate         0x000f   088   060   045    Pre-fail  Always       -       564858735
This indicates that it's what the manufacturer thinks indicates when a drive could befailing, and in addition to that, there's also evidence that it isn't the cable both from the fact that it's both the read and write columns and they're not low-digits (unless it's been growing steadily over a long time), in addition to this (which is also a pre-fail attribute):
pre:
Raw_Read_Error_Rate     0x000f   078   064   044    Pre-fail  Always       -       65897984
Hardware ECC Recovered being that high is also worrying, yeah - but at least that part's actually working (in that the errors are being recovered, as opposed to not).

The drive is absolutely in the process of failing.

EDIT: Doesn't /var/log/messages contain tons of messages from CAM about READs and WRITEs having failed?

If it is failing I don't think it has anything to do with those counters, which if this is anything to go by actually shows no errors at all since there's nothing past 32 bits in any of those counts. In fact, briefly looking at every seagate drive I was able to find in anything I could SSH to, they're all like that: large values, but since nothing in bits 32-48 are set no actual issues. Or that dude is wrong and every seagate drive is failing at about the same time.

So it's probably worth chasing down controller/cable issues at this point still since the smart stuff seems fine.

BlankSystemDaemon
Mar 13, 2009




ComWalk posted:

If it is failing I don't think it has anything to do with those counters, which if this is anything to go by actually shows no errors at all since there's nothing past 32 bits in any of those counts. In fact, briefly looking at every seagate drive I was able to find in anything I could SSH to, they're all like that: large values, but since nothing in bits 32-48 are set no actual issues. Or that dude is wrong and every seagate drive is failing at about the same time.

So it's probably worth chasing down controller/cable issues at this point still since the smart stuff seems fine.
I never look at the normalized values from smartmontools, because they're never the same across manufacturers.
The source for all of these claims come exclusively from this Frank Zabkar, and all around the same time (Early Febuary, 2009) based on references to threads on Usenet that he participated in - so I'm not really sure that they're worth trusting.
If Seagate goes out of the way to obfuscate raw information values, then the next best thing we can do is compare it with similar drives in the same array (assuming there are any).

There's definitely an issue and I still suspect the drive, irrespective of the S.M.A.R.T attributes - because not only is CAM (part of the storage subsystem) producing the exact errors I'd expect to see when failing to issue WRITE or READ operations on FreeBSD, but ZFS wouldn't be FAULTing the drive if that wasn't happening.
A cable coming lose after running for about 100 days isn't that likely, unless there's been a recent maintenance window.

Combat Pretzel
Jun 23, 2004

No, seriously... what kurds?!
Regarding Seagate's SMART data. Things like error rates and what not are a combination of two values, one 16bit and another 32bit, leading to some 48bit value that reads insane but may just mean nothing.

Those weird rear end seek and read error rates also irritated me, after installing my disks. I think the upper 16bit count the actual errors, so I guess if the value is below 2 or 4 billion (gently caress if I know whether they're signed or not), you're fine.

--e:f,b

phosdex
Dec 16, 2005

Sir Bobert Fishbone posted:

I migrated my entire VM stack from Proxmox to ESXi free edition this week, and this drive is absolutely failing in weird and wonderful ways. It lasts all of 2-10 minutes after reboot before VMWare decides it's inaccessible and shuts it down. I installed Windows on that box bare-metal just so I could run Gigabyte's special SSD tool, and that tool reports everything as being totally fine, including normal temperature readings. Going to have to call it a loss and move on, I think. Just wish it behaved consistently.

What version of ESXi are you trying? If you just did it, probably unlikely but VMware pushed out a patch last year to vsphere 7 that they actually had to pull and tell people to revert, iirc it had drive killing problems. If you are on 7.0 U3 A or B, then you need to patch to C.

Klyith
Aug 3, 2007

GBS Pledge Week

BlankSystemDaemon posted:

I never look at the normalized values from smartmontools, because they're never the same across manufacturers.
If Seagate goes out of the way to obfuscate raw information values

Raw values are explicitly for the manufacturer to store information in whichever way they want to. Storing two values in 48 bits isn't obfuscation, it's efficient use of a ancient standard that has very limited space.

Seeing massive numbers on various raw values in smart is super duper common, I don't know how this can be new information to anyone who knows hard drives. There are decades of Qs on stackexchange and slashdot of people asking if this means their drive is dying.


If you always look at raw values and never reflect on how WORST is well above THRESH, and how the treating that as a straight value would mean a 4:1 ratio of errors versus blocks read in the drive's entire lifetime, you must replace a lot of drives!

ComWalk
Mar 4, 2007

BlankSystemDaemon posted:

I never look at the normalized values from smartmontools, because they're never the same across manufacturers.
The source for all of these claims come exclusively from this Frank Zabkar, and all around the same time (Early Febuary, 2009) based on references to threads on Usenet that he participated in - so I'm not really sure that they're worth trusting.
If Seagate goes out of the way to obfuscate raw information values, then the next best thing we can do is compare it with similar drives in the same array (assuming there are any).

There's definitely an issue and I still suspect the drive, irrespective of the S.M.A.R.T attributes - because not only is CAM (part of the storage subsystem) producing the exact errors I'd expect to see when failing to issue WRITE or READ operations on FreeBSD, but ZFS wouldn't be FAULTing the drive if that wasn't happening.
A cable coming lose after running for about 100 days isn't that likely, unless there's been a recent maintenance window.

Yeah I was only looking at the raw value there which doesn't seem to be indicating errors at all. I get that random dude isn't super confidence inspiring but for what it's worth that method does result in matching normalized values on the first six drives I tested, so it seems really likely that the 16/32 bit values are what he suggests they are, and thus that the drive itself thinks that it is fine. Recent drives too, 2018-2021 manufacture dates. I don't trust drives to recognize that they're failing or anything but it doesn't look like there's a smoking gun there is all.

I'd be less worried about a cable coming loose and more an SFF-8087 cable that got damaged during installation and it just took days/weeks/months of vibration to finally have it come to a head. Some of them are annoyingly fragile.

Plus with that controller log showing da1 also giving the controller issues I'd be even more skeptical of the controller/cable. Two drives falling out within seconds of each other seems like there are bigger problems to chase down than just a single drive.

fake edit: actually, looking closer, those CAM errors were from the WDC drive? that does not look like a happy system

ComWalk fucked around with this message at 20:50 on Feb 11, 2022

Sir Bobert Fishbone
Jan 16, 2006

Beebort

phosdex posted:

What version of ESXi are you trying? If you just did it, probably unlikely but VMware pushed out a patch last year to vsphere 7 that they actually had to pull and tell people to revert, iirc it had drive killing problems. If you are on 7.0 U3 A or B, then you need to patch to C.

I tried 7.0 U3C, as well as the most recent 6.7. I'll have to look into that failed patch, though, because I was running ESXi late last year on this hardware. Thanks for the heads up.

CopperHound
Feb 14, 2012

ComWalk posted:

fake edit: actually, looking closer, those CAM errors were from the WDC drive? that does not look like a happy system
Okay, I'm dumb. All those errors are from me swapping drive bays. I thought my logs were on UTC instead of local time. I have no cam messages from past several days other than that.

Cantide
Jun 13, 2001
Pillbug

IOwnCalculus posted:

Agreed, it's 2022, don't use mdraid. Even if you never snapshot or anything like that, it's more reliable.

On a running Ubuntu server you're only looking at three commands to get your ZFS array up, and the second one is just to get the paths to your specific drives.

code:
sudo apt install zfsutils-linux
ls /dev/disk/by-id
sudo zpool create tank mirror /dev/disk/by-id/path-to-drive-1 /dev/disk/by-id/path-to-drive-2
That's it, you have a mirror mounted at /tank now. It would also be a good idea to set up regular scrubs of the pool by adding something like this to your root crontab:
code:
0 0 1 * * /sbin/zpool scrub tank

Is it also possible to retrofit my current OS drive into a bootable mirrored zpool?
I went with mdadm because I read doing that with zfs would be hell. (this is roughly the guide I followed: https://feeding.cloud.geek.nz/posts/setting-up-raid-on-existing/)

SuperiorColliculus
Oct 31, 2011

So I’ve moved countries, and have set up a temporary Plex/file server on an old computer running Windows I had kicking around my parent’s place while I wait for my actual server to get here via the slow boat from China (almost, but not literally).

I have an NTFS formatted 14TB drive I brought with me almost full, with a view to having something to watch during quarantine and then to build an ad-hoc server. Everything was going fine until I noticed writes/copies by Sonarr/Radarr from the 256GB SSD OS drive to the 14TB drive were failing. Reading from the data already on the drive is fine. In windows, the partition (all one partition, GPT, basic) shows up fine in computer management, but various “partition” programs show the space as raw or unpartitioned.

I’m assuming the partition table is messed in a non-obvious way. Is there a piece of software I’m overlooking that will let me non-destructively rebuild the partition table? I don’t have another 14TB drive (here - I have like 40TB in my server which will be here in two months) to copy all this data to, and I’d rather continue being able to watch the movies/tv I already have on there. Obviously the easy way is duplicate the data, reformat the drive, but I just don’t have that capability right here right now.

Klyith
Aug 3, 2007

GBS Pledge Week

SuperiorColliculus posted:

So I’ve moved countries, and have set up a temporary Plex/file server on an old computer running Windows I had kicking around my parent’s place while I wait for my actual server to get here via the slow boat from China (almost, but not literally).

I have an NTFS formatted 14TB drive I brought with me almost full, with a view to having something to watch during quarantine and then to build an ad-hoc server. Everything was going fine until I noticed writes/copies by Sonarr/Radarr from the 256GB SSD OS drive to the 14TB drive were failing. Reading from the data already on the drive is fine. In windows, the partition (all one partition, GPT, basic) shows up fine in computer management, but various “partition” programs show the space as raw or unpartitioned.

I’m assuming the partition table is messed in a non-obvious way. Is there a piece of software I’m overlooking that will let me non-destructively rebuild the partition table?

:chloe: Do all writes to the 15TB drive fail? Like if you copy a small 10-20KB file over in explorer, does that work?

How old is the windows PC that you put this drive into? I'm wondering if it's old enough that you've got compatibility problems with a 4k native drive or something like that. If the PC hardware is really old, that maybe could be the problem? This is a bit of a stretch, I'd have thought a problem like that would result in the controller just not being able to recognize the drive... But if this is at all possible I would avoid doing anything to the drive in that PC, especially writes to the GPT area. Any writes could just be barfing bad data on it.




Aside from that, have you already done a full chkdsk /R scan? On a 14TB drive that's gonna take a while, but it's the most non-destructive starting point. Theoretically resizing a NFTS partition in windows drive management should be non-destructive. That would re-write the relevant GPT info. Feels like playing with fire in this case.

A linux tool gdisk can recover GPT by copying the backup GPT from the end of the drive to the "normal" beginning-of-drive table. But I don't think this makes sense as a solution -- modern partition programs should look in both places as well.

CopperHound
Feb 14, 2012

CopperHound posted:

Okay, I'm dumb. All those errors are from me swapping drive bays. I thought my logs were on UTC instead of local time. I have no cam messages from past several days other than that.
Okay, the last errors in messages were from Feb 2nd even though I could have sworn I saw the activity light on that drive going the day before yesterday.
code:
Feb  2 19:21:54 nas     (da0:mps0:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 282 Command timeout on target 0(0x0010) 60000set, 60.6313204 elapsed
Feb  2 19:21:54 nas mps0: Sending abort to target 0 for SMID 282
Feb  2 19:21:54 nas     (da0:mps0:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 282 Aborting command 0xfffffe00e9499af0
Feb  2 19:21:54 nas mps0: mpssas_action_scsiio: Freezing devq for target ID 0
Feb  2 19:21:54 nas (da0:mps0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 d7 e3 49c8 00 00 00 08 00 00
Feb  2 19:21:54 nas (da0:mps0:0:0:0): CAM status: CAM subsystem is busy
Feb  2 19:21:54 nas (da0:mps0:0:0:0): Retrying command, 3 more tries remain
Feb  2 19:21:54 nas mps0: Controller reported scsi ioc terminated tgt 0 SMID 1595 loginfo 31130000
Feb  2 19:21:54 nas mps0: Controller reported scsi ioc terminated tgt 0 SMID 559 loginfo 31130000
Feb  2 19:21:54 nas mps0: Controller reported scsi ioc terminated tgt 0 SMID 545 loginfo 31130000
Feb  2 19:21:54 nas mps0: Controller reported scsi ioc terminated tgt 0 SMID 1192 loginfo 31130000
Feb  2 19:21:54 nas (da0:mps0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 6c 2d ff80 00 00 00 18 00 00
Feb  2 19:21:54 nas mps0: (da0:mps0:0:0:0): CAM status: CCB request completed with an error
Feb  2 19:21:54 nas (da0:mps0:0:0:0): Retrying command, 3 more tries remain
Feb  2 19:21:54 nas (da0:mps0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 50 22 7320 00 00 00 18 00 00
Feb  2 19:21:54 nas (da0:mps0:0:0:0): CAM status: CCB request completed with anerror
Feb  2 19:21:54 nas (da0:mps0:0:0:0): Retrying command, 3 more tries remain
Feb  2 19:21:54 nas (da0:mps0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 50 22 7308 00 00 00 18 00 00
Feb  2 19:21:54 nas Finished abort recovery for target 0
Feb  2 19:21:54 nas (da0:mps0:0:0:0): CAM status: CCB request completed with anerror
Feb  2 19:21:54 nas (da0:mps0:0:0:0): Retrying command, 3 more tries remain
Feb  2 19:21:54 nas (da0:mps0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 50 22 72f0 00 00 00 18 00 00
Feb  2 19:21:54 nas mps0: (da0:mps0:0:0:0): CAM status: CCB request completed with an error
Feb  2 19:21:54 nas (da0:mps0:0:0:0): Retrying command, 3 more tries remain
Feb  2 19:21:54 nas Unfreezing devq for target ID 0
Feb  2 19:21:54 nas (da0:mps0:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Feb  2 19:21:54 nas (da0:mps0:0:0:0): CAM status: Command timeout
Feb  2 19:21:54 nas (da0:mps0:0:0:0): Retrying command, 0 more tries remain
Feb  2 19:21:54 nas (da0:mps0:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Feb  2 19:21:54 nas (da0:mps0:0:0:0): CAM status: SCSI Status Error
Feb  2 19:21:54 nas (da0:mps0:0:0:0): SCSI status: Check Condition
Feb  2 19:21:54 nas (da0:mps0:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb  2 19:21:54 nas (da0:mps0:0:0:0): Error 6, Retries exhausted
Feb  2 19:21:54 nas (da0:mps0:0:0:0): Invalidating pack
also, the truenas core web terminal sucks for copying text.

Motronic
Nov 6, 2009

CopperHound posted:

also, the truenas core web terminal sucks for copying text.

For copying text is the least of it. I can't imagine trying to use that for anything.

Get a decent ssh client.

Klyith
Aug 3, 2007

GBS Pledge Week

Motronic posted:

Get a decent ssh client.

just type ssh on the windows command line, no need to get anything in win10 & 11

MS ported the whole thing natively

Motronic
Nov 6, 2009

Klyith posted:

just type ssh on the windows command line, no need to get anything in win10 & 11

MS ported the whole thing natively

Wait what?

<hang on>

What the hell? I'm still using a 1990s version of Reflection because that's what I'm used to. I had no idea that was actually on the CLI.

(will continue to use my old man client....and to be fair I don't use windows all the much other than on the laptop I happen to be on right now)

Klyith
Aug 3, 2007

GBS Pledge Week
Also if you get windows terminal you can put SSH sessions directly on the tab-launcher thing. poo poo's tight.



(As you can see I only have a Pi and a router at home, anyone with home servers and NAS stuff should definitely do this)

Motronic posted:

(will continue to use my old man client....and to be fair I don't use windows all the much other than on the laptop I happen to be on right now)

Between WSL2 and new poo poo like terminal, MS is really taking the "embrace and extend" thing to a whole new level... and that level is "what if windows, but also linux?"

I don't know if that will work out on the "extinguish" front for them, 'cause for me it's done more to make me want to move to linux full time.

Adbot
ADBOT LOVES YOU

Motronic
Nov 6, 2009

Klyith posted:

Also if you get windows terminal you can put SSH sessions directly on the tab-launcher thing. poo poo's tight.

Holy crap, it's an iTerm rip off. I love it. Thank you.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply