Repairing an ext4 filesystem (mdadm raid5 -> LVM2 -> ext4)

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Haus of Tech Support > Repairing an ext4 filesystem (mdadm raid5 -> LVM2 -> ext4)

bacon!: Dec 10, 2003; The fierce urgency of now

Problem description: I built a fileserver in 2013. It ran Centos6, and had 2 logical storage units:

1x SSD (/boot, /, swap) 250gb
5x 3TB drives, arranged in MDADM Raid5

The assembled MDADM array was tagged as a PV for LVM2, which had a single VG on top of that, and finally a single LV that spanned the entire 11TB free space. I don't remember exactly why I chose to drop an LVM situation on top of MDADM, but it was 6 years ago and I'm sure it felt like a good idea at the time.

Recently, this system began failing in strange ways. Occasional reboots, consistent crashing of packages that worked previously, corrupt RPM database, and some other problems. I ran checks on the Raid5 array (using the MDADM scrubber and e2fsck) and it appeared to be healthy, so my assumption is that some poor sysadmin skills let the OS get into a bunk state. This, combined with the fact that Plex EOL'd support for Centos6, and I decided to just upgrade the OS to Centos7.

I backed up my LVM & MDADM conf before the install, with the hope of quickly getting the raid array back online after a reinstall. The MDADM array re-assembled instantly with no issues. The physical volume struggled - there were fewer sectors available on the MDADM array after the OS reinstall than before, so the extent calculations in the backed-up LVM config was off by a few sectors. Original sector count: 23441080320 . After re-install: 23441072128 . Weird. After tweaking the extent size of the PV, it came back online, and the VG & LV shortly followed.

*The big problem*: the OS now won't recognize that there is a valid ext4 filesystem on the LV, and it won't mount it correctly (details from the PV issues are in the top part of this pastebin, and details about trying to mount the LV are in the bottom):

https://pastebin.com/mLADR7xx

Recent changes:

My next attempt was to try running TestDisk and see if it could find the partition information somewhere on the device. However, upon running that for about a minute, the system went completely out of control. Other processes started segfaulting. The CPU was hung, and I kept getting kernel warning interrupts. I was unable to even kill the process and had to perform a hard reboot:

Disk /dev/vg_nas/media - 12 TB / 10 TiB - 23441055744 sectors
Analyse sector 19136512/23441055743: 00%
Message from syslogd@illmatic at Aug 25 14:18:32 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [testdisk:15054]

Operating system: CentOS 6 / CentOS 7 (completely up to date, no packages outside of centos base & EPEL)

System specs:
ECS Elitegroup Micro ATX DDR3 1600 LGA 1150 Motherboards H87H3-M
Intel I3-4130T 2.90 3 LGA 1150 Processor BX80646I34130T
Crucial 240GB M500
Antec EarthWatts EA-380D Green 380 Watt 80 PLUS BRONZE Power Supply
5x WD RED 3TB

Location: US

I have Googled and read the FAQ: substantially.

1) Is there another option better than testdesk to try to scan the partition and get the filesystem working again? I'm not super familiar with those forensic USB situations, but if any of those are better, they'd need to support LVM & MDADM
2) Is it some kind of weird red flag that just scanning the disk caused a total system meltdown? could something else be going on?
3) If you don't know how to help, at least put this handy LVM recovery presentation in your bookmarks, it's quite good and helped get me through the earlier steps. https://mbroz.fedorapeople.org/talks/LinuxAlt2009_2/lvmrecovery.pdf
Thanks in advance!

# ? Aug 26, 2019 07:35

Adbot: ADBOT LOVES YOU

# ? May 3, 2024 09:30

bacon!: Dec 10, 2003; The fierce urgency of now

A bit of an update - this is even more baffling.

I grabbed a recovery ISO with a bunch of test & recovery tools on it. Booted up, first did an error check on each disk independently. No problems. Then, I re-assembled the raid array, enabled the LV, and tried to mount it. I got an error [expected] but this time, the OS prompted me to run e2fsck. So I did, and it fixed a handful of errors, and then I was able to mount it successfully - all files present and accounted for.

Afterwards, I booted back into CentOS7, rebuilt the array, brought up the LV..and same crap!

[root@illmatic ~]# mount /dev/vg_nas/media /illmatic
mount: /dev/mapper/vg_nas-media is write-protected, mounting read-only
mount: unknown filesystem type '(null)'

This is pretty weird.

# ? Aug 28, 2019 06:42

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Haus of Tech Support > Repairing an ext4 filesystem (mdadm raid5 -> LVM2 -> ext4)