User Tools

Site Tools


unix:lvm_recovery

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

unix:lvm_recovery [2013/05/27 13:56]
robm [... 2 weeks later]
unix:lvm_recovery [2013/08/20 22:54]
Line 1: Line 1:
-====== Recovering my lost data: LVM and RAID ====== 
  
-So I upgraded Ubuntu 9.10 to Ubuntu 11.10. When the system boots it says it cannot mount ''/store'', my 960GB RAID+LVM file-system. That the one that holds over 10 years of personal photographs and such. :-( 
- 
-===== About the file-system: ''store'' ===== 
- 
-There are many layers of indirection between the file-system and the physical storage when using LVM or RAID. When using both, the number of layers can seem excessive. Here's a diagram of the layers involved in my (lost) setup: 
- 
-<graphviz> 
-digraph G { 
-node [shape=box] 
- 
-md0 [label="RAID5 block device: md0"] 
-sdb1 [label="AutoRAID partition: sdb1"] 
-sdc1 [label="AutoRAID partition: sdc1"] 
-sdd1 [label="AutoRAID partition: sdd1"] 
-sde1 [label="AutoRAID partition: sde1"] 
-sdf1 [label="AutoRAID partition: sdf1"] 
-sdb [label="Disk: 320GB SATA: sdb"] 
-sdc [label="Disk: 320GB SATA: sdc"] 
-sdd [label="Disk: 320GB SATA: sdd"] 
-sde [label="Disk: 320GB SATA: sde"] 
-sdf [label="Disk: 320GB SATA: sdf"] 
- 
-fs_store [label="EXT4 file-system: store"] 
-lv_store [label="LVM logical volume: lv_store"] 
-vg_store [label="LVM volume group: vg_store"] 
-pv_store [label="LVM physical volume: pv_store"] 
- 
-sdb1 -> sdb 
-sdc1 -> sdc 
-sdd1 -> sdd 
-sde1 -> sde 
-sdf1 -> sdf 
-md0 -> {sdb1 sdc1 sdd1 sde1 sdf1} 
-pv_store -> md0 
-vg_store -> pv_store 
-lv_store -> vg_store 
-fs_store -> lv_store 
-} 
-</graphviz> 
- 
-Note that the RAID block device, ''md0'', is not partitioned. I believe that was a mistake on my part, and a likely reason why Ubuntu 11.10 cannot auto-detect it. 
- 
-===== Problem statement ===== 
- 
-Since upgrading system boot is interrupted with an error screen to the affect of "Cannot mount /store" and a prompt to enter a root shell or skip mounting. From what I can see, the RAID array is detected without problems, and is functioning correctly. So the system looks like this: 
- 
-<graphviz> 
-digraph G { 
-node [shape=box] 
- 
-md0 [label="RAID5 block device: md0 (appears to be unformatted)"] 
-sdb1 [label="AutoRAID partition: sdb1"] 
-sdc1 [label="AutoRAID partition: sdc1"] 
-sdd1 [label="AutoRAID partition: sdd1"] 
-sde1 [label="AutoRAID partition: sde1"] 
-sdf1 [label="AutoRAID partition: sdf1"] 
-sdb [label="Disk: 320GB SATA: sdb"] 
-sdc [label="Disk: 320GB SATA: sdc"] 
-sdd [label="Disk: 320GB SATA: sdd"] 
-sde [label="Disk: 320GB SATA: sde"] 
-sdf [label="Disk: 320GB SATA: sdf"] 
- 
-sdb1 -> sdb 
-sdc1 -> sdc 
-sdd1 -> sdd 
-sde1 -> sde 
-sdf1 -> sdf 
-md0 -> {sdb1 sdc1 sdd1 sde1 sdf1} 
-} 
-</graphviz> 
- 
-The RAID (multi-disk) status looks fine to me: 
- 
-<code> 
-root@ikari:~# cat /proc/mdstat  
-Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]  
-md127 : active raid5 sdb[1] sde[3] sdc[0] sdd[2] sdf[4](S) 
-      937713408 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] 
-       
-unused devices: <none> 
-</code> 
- 
-but the resulting 960.2 GB block device is partitioned as a "Linux RAID autodetect" - which would suggest that it is a *member* of some other multi-disk setup. This, I believe, is human error on my part when I created the thing... 
- 
-<code> 
-root@ikari:~# fdisk -l /dev/md127 
- 
-Disk /dev/md127: 960.2 GB, 960218529792 bytes 
-255 heads, 63 sectors/track, 116739 cylinders, total 1875426816 sectors 
-Units = sectors of 1 * 512 = 512 bytes 
-Sector size (logical/physical): 512 bytes / 512 bytes 
-I/O size (minimum/optimal): 65536 bytes / 196608 bytes 
-Disk identifier: 0xd71c877b 
- 
-      Device Boot      Start         End      Blocks   Id  System 
-/dev/md127p1              63   625137344   312568641   fd  Linux RAID autodetect 
-Partition 1 does not start on physical sector boundary. 
-</code> 
- 
-====== Recovery strategy ====== 
- 
-  - Create a disk-image of ''md127'', partition table and all (required buying an external 2TB USB drive) 
-  - Use LVM snapshots with this disk-image to make (and quickly roll-back) experimental changes 
- 
-===== Formatting the external USB drive ===== 
- 
-So I created a single "Linux LVM" partition on the 2TB disk, created a single 1.8TB physical volume and a single 1.8TB volume group containing it. On this I created a 1TB logical volume called ''lv_scratch'' and copied the contents of  ''md127'' to it (e.g. ''dd if=/dev/md127 of=/dev/mapper/lv_scratch''). Once the copy was made, I created a snapshot of ''lv_scratch'' which I imaginatively called ''snap''. 
- 
-LVM snapshots are interesting creatures. As the name suggests, the snapshot (named ''snap'') holds the state of ''lv_scratch'' as it was when I created it. I can still read and write to ''lv_scratch'', but the contents of ''snap'' will not change. This is ideal for making consistent backups. The snapshot works by deferring any writes to ''lv_scratch'' and placing them instead in some temporary copy-on-write (COW) volume. All access to ''lv_scratch'' consults the COW volume - when there is a hit it is returned, otherwise the original (unchanged) ''lv_scratch'' is read. When the snapshot is deleted, the deferred changes stored in ''snap'' are written to ''lv_scratch'' and become permanent. Makes sense if you are used to copy-on-write behaviour. 
- 
-Now here is where things get interesting. The snapshot, ''snap'', does not have to be read-only: you can create it read-write. Doing so gives you a very cheap copy of ''lv_scratch'', and any changes you make to the snapshot are stored in a COW table. You can discard the changes by deleting the snapshot. Ideal for my situation: I want to experiment with the partition table and various file-system recovery tools etc. I let these manipulate the snapshot, and if things go bad I delete and recreate the snapshot and try over. 
- 
-<graphviz> 
-digraph G { 
-sdj [label="2TB USB disk: sdj"] 
-sdj1 [label="Linux LVM partition: sdj1"] 
-pv_scratch [label="LVM physical volume"] 
-vg_scratch [label="LVM volume group: vg_scratch"] 
-lv_scratch [label="LVM logical volume: lv_scratch"] 
-snap [label="LVM logical volume: snap"] 
- 
-lv_scratch -> vg_scratch -> pv_scratch -> sdj1 -> sdj 
-snap -> vg_scratch 
- 
-snap -> lv_scratch [style="dashed",arrowhead="none"] 
-} 
-</graphviz> 
- 
-<code> 
-root@ikari:~# fdisk -l /dev/sdj 
- 
-Disk /dev/sdj: 2000.4 GB, 2000398934016 bytes 
-255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors 
-Units = sectors of 1 * 512 = 512 bytes 
-Sector size (logical/physical): 512 bytes / 512 bytes 
-I/O size (minimum/optimal): 512 bytes / 512 bytes 
-Disk identifier: 0x000f0222 
- 
-   Device Boot      Start         End      Blocks   Id  System 
-/dev/sdj1            2048  3907028991  1953513472   8e  Linux LVM 
-</code> 
- 
-<code> 
-root@ikari:~# pvdisplay  
-  /dev/dm-0: read failed after 0 of 4096 at 0: Input/output error 
-  /dev/dm-1: read failed after 0 of 4096 at 0: Input/output error 
-  /dev/dm-2: read failed after 0 of 4096 at 0: Input/output error 
-  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error 
-  --- Physical volume --- 
-  PV Name               /dev/sdj1 
-  VG Name               vg_scratch 
-  PV Size               1.82 TiB / not usable 4.00 MiB 
-  Allocatable           yes  
-  PE Size               4.00 MiB 
-  Total PE              476931 
-  Free PE               86787 
-  Allocated PE          390144 
-  PV UUID               nrf9cQ-Asfz-Y2x2-SDoT-3ppu-mpEC-Fnuf8Z 
-    
-root@ikari:~# vgdisplay 
-  /dev/dm-0: read failed after 0 of 4096 at 0: Input/output error 
-  /dev/dm-1: read failed after 0 of 4096 at 0: Input/output error 
-  /dev/dm-2: read failed after 0 of 4096 at 0: Input/output error 
-  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error 
-  --- Volume group --- 
-  VG Name               vg_scratch 
-  System ID              
-  Format                lvm2 
-  Metadata Areas        1 
-  Metadata Sequence No  13 
-  VG Access             read/write 
-  VG Status             resizable 
-  MAX LV                0 
-  Cur LV                2 
-  Open LV               0 
-  Max PV                0 
-  Cur PV                1 
-  Act PV                1 
-  VG Size               1.82 TiB 
-  PE Size               4.00 MiB 
-  Total PE              476931 
-  Alloc PE / Size       390144 / 1.49 TiB 
-  Free  PE / Size       86787 / 339.01 GiB 
-  VG UUID               Lk7UZP-48xF-vBPi-6g8F-sXlF-qyzy-pQNKgq 
-    
-root@ikari:~# lvdisplay 
-  /dev/dm-0: read failed after 0 of 4096 at 0: Input/output error 
-  /dev/dm-1: read failed after 0 of 4096 at 0: Input/output error 
-  /dev/dm-2: read failed after 0 of 4096 at 0: Input/output error 
-  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error 
-  --- Logical volume --- 
-  LV Name                /dev/vg_scratch/lv_scratch 
-  VG Name                vg_scratch 
-  LV UUID                aFBpgv-gqcd-jjLU-c7xO-Jyeb-2R0t-HpEF84 
-  LV Write Access        read/write 
-  LV snapshot status     source of 
-                         /dev/vg_scratch/snap [active] 
-  LV Status              available 
-  # open                 0 
-  LV Size                1.00 TiB 
-  Current LE             262144 
-  Segments               1 
-  Allocation             inherit 
-  Read ahead sectors     auto 
-  - currently set to     256 
-  Block device           253:1 
-    
-  --- Logical volume --- 
-  LV Name                /dev/vg_scratch/snap 
-  VG Name                vg_scratch 
-  LV UUID                OvOsQ7-uACi-xJVZ-vseu-fKEc-F73h-CmSalH 
-  LV Write Access        read/write 
-  LV snapshot status     active destination for /dev/vg_scratch/lv_scratch 
-  LV Status              available 
-  # open                 0 
-  LV Size                1.00 TiB 
-  Current LE             262144 
-  COW-table size         500.00 GiB 
-  COW-table LE           128000 
-  Allocated to snapshot  0.00%  
-  Snapshot chunk size    4.00 KiB 
-  Segments               1 
-  Allocation             inherit 
-  Read ahead sectors     auto 
-  - currently set to     256 
-  Block device           253:3 
-</code> 
- 
-So here's the goal I'm aiming for on my external storage: 
- 
-<graphviz> 
-digraph G { 
-sdj [label="2TB USB disk: sdj"] 
-sdj1 [label="Linux LVM partition: sdj1"] 
-pv_scratch [label="LVM physical volume"] 
-vg_scratch [label="LVM volume group: vg_scratch"] 
-lv_scratch [label="LVM logical volume: lv_scratch"] 
-snap [label="LVM logical volume: snap"] 
- 
-lv_scratch -> vg_scratch -> pv_scratch -> sdj1 -> sdj 
-snap -> vg_scratch 
-snap -> lv_scratch [style="dashed",arrowhead="none"] 
- 
-node [shape=box] 
-fs_store [label="EXT4 file-system: store"] 
-lv_store [label="LVM logical volume: lv_store"] 
-vg_store [label="LVM volume group: vg_store"] 
-pv_store [label="LVM physical volume: pv_store"] 
- 
- 
-pv_store -> snap 
-vg_store -> pv_store 
-lv_store -> vg_store 
-fs_store -> lv_store 
- 
- 
-} 
-</graphviz> 
- 
-====== Recognising the nested LVM volumes ====== 
- 
-Since it's been a few days and reboots since I last worked on this, I'll start by plugging the USB drive it. 
- 
-<code> 
-root@ikari:~# dmesg 
-[  479.180019] usb 2-5: new high speed USB device number 7 using ehci_hcd 
-[  479.313228] scsi13 : usb-storage 2-5:1.0 
-[  480.312605] scsi 13:0:0:0: Direct-Access     Seagate  Desktop          0130 PQ: 0 ANSI: 4 
-[  480.336633] sd 13:0:0:0: Attached scsi generic sg10 type 0 
-[  480.337029] sd 13:0:0:0: [sdi] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB) 
-[  480.337671] sd 13:0:0:0: [sdi] Write Protect is off 
-[  480.337671] sd 13:0:0:0: [sdi] Mode Sense: 2f 08 00 00 
-[  480.340027] sd 13:0:0:0: [sdi] No Caching mode page present 
-[  480.340027] sd 13:0:0:0: [sdi] Assuming drive cache: write through 
-[  480.341806] sd 13:0:0:0: [sdi] No Caching mode page present 
-[  480.341811] sd 13:0:0:0: [sdi] Assuming drive cache: write through 
-[  480.357290]  sdi: sdi1 
-[  480.359346] sd 13:0:0:0: [sdi] No Caching mode page present 
-[  480.359350] sd 13:0:0:0: [sdi] Assuming drive cache: write through 
-[  480.359354] sd 13:0:0:0: [sdi] Attached SCSI disk 
-</code> 
- 
-The (outer) LVM PVs are automatically detects, and their VGs + LVs are subsequently detected: 
- 
-<code> 
-root@ikari:~# pvs 
-  PV         VG         Fmt  Attr PSize PFree   
-  /dev/sdi1  vg_scratch lvm2 a-   1.82t 339.01g 
- 
-root@ikari:~# vgs 
-  VG         #PV #LV #SN Attr   VSize VFree   
-  vg_scratch       1 wz--n- 1.82t 339.01g 
- 
-root@ikari:~# lvs 
-  LV         VG         Attr   LSize   Origin     Snap%  Move Log Copy%  Convert 
-  lv_scratch vg_scratch owi-a-   1.00t                                           
-  snap       vg_scratch swi-a- 500.00g lv_scratch   0.00    
-</code> 
- 
-Somewhere on the ''snap'' logcial volume is my nested LVM. I used ''xxd /dev/vg_scratch/snap | less'' and searched for ''LVM2''. The first hit was a false-positive (appeared to have stripes of NULLs written across it), but the second hit looked plausible: 
- 
-<code> 
-8018600: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018610: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018620: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018630: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018640: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018650: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018660: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018670: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018680: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018690: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-80186a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-80186b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-80186c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-80186d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-80186e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-80186f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018700: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018710: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018720: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018730: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018740: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018750: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018760: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018770: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018780: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018790: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-80187a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-80187b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-80187c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-80187d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-80187e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-80187f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018800: 4c41 4245 4c4f 4e45 0100 0000 0000 0000  LABELONE........ 
-8018810: 9148 4053 2000 0000 4c56 4d32 2030 3031  .H@S ...LVM2 001 
-8018820: 5341 7536 6e32 7578 474c 5148 6743 5351  SAu6n2uxGLQHgCSQ 
-8018830: 6b56 6b5a 655a 4c78 7874 314b 7652 6a31  kVkZeZLxxt1KvRj1 
-8018840: 00f8 0391 df00 0000 0000 0300 0000 0000  ................ 
-8018850: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018860: 0000 0000 0000 0000 0010 0000 0000 0000  ................ 
-8018870: 00f0 0200 0000 0000 0000 0000 0000 0000  ................ 
-8018880: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-8018890: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-80188a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-</code> 
- 
-I know from using ''xxd'' to examine a correct and functioning LVM2 partition (the PV behind ''vg_scratch'' as it happens) that the "LVM2" text should appear at 0x210. So I'll create a loopback device with an appropriate offset to make that happen: 
- 
-<code> 
-root@ikari:~# losetup /dev/loop0 /dev/vg_scratch/snap --offset $((0x8018600)) 
- 
-root@ikari:~# lvmdiskscan  
-  /dev/ram0  [      64.00 MiB]  
-  /dev/loop0 [     894.15 GiB] LVM physical volume 
-  /dev/dm-0  [     186.27 GiB]  
-  /dev/ram1  [      64.00 MiB]  
-  /dev/sda1  [     294.09 GiB]  
-  /dev/dm-1  [     894.27 GiB]  
-  /dev/ram2  [      64.00 MiB]  
-  /dev/dm-2  [     894.27 GiB]  
-  /dev/ram3  [      64.00 MiB]  
-  /dev/dm-3  [     894.27 GiB]  
-  /dev/ram4  [      64.00 MiB]  
-  /dev/dm-4  [     782.47 GiB]  
-  /dev/ram5  [      64.00 MiB]  
-  /dev/sda5  [       4.00 GiB]  
-  /dev/dm-5  [     715.38 GiB]  
-  /dev/ram6  [      64.00 MiB]  
-  /dev/ram7  [      64.00 MiB]  
-  /dev/ram8  [      64.00 MiB]  
-  /dev/ram9  [      64.00 MiB]  
-  /dev/ram10 [      64.00 MiB]  
-  /dev/ram11 [      64.00 MiB]  
-  /dev/ram12 [      64.00 MiB]  
-  /dev/ram13 [      64.00 MiB]  
-  /dev/ram14 [      64.00 MiB]  
-  /dev/ram15 [      64.00 MiB]  
-  /dev/sdb1  [       1.82 TiB] LVM physical volume 
-  0 disks 
-  24 partitions 
-  0 LVM physical volume whole disks 
-  2 LVM physical volumes 
- 
-root@ikari:~# pvs 
-  PV         VG         Fmt  Attr PSize   PFree   
-  /dev/loop0 store_vg   lvm2 a-   894.25g 178.88g 
-  /dev/sdb1  vg_scratch lvm2 a-     1.82t      0  
-</code> 
- 
-If ''lvmdiskscan'' doesn work you could try using ''partprobe'' to tell the Kernel to rescan partition tables and do what it does. 
- 
-<code> 
-root@ikari:~# lvs 
-  LV         VG         Attr   LSize   Origin     Snap%  Move Log Copy%  Convert 
-  store_lv   store_vg   -wi-a- 715.38g                                           
-  home_zfs   vg_scratch -wi-a- 186.27g                                           
-  lv_scratch vg_scratch owi-a- 894.27g                                           
-  snap       vg_scratch swi-ao 782.47g lv_scratch   0.00     
-</code> 
- 
-====== Finding the ext4 file-system ====== 
- 
-OK, so now I can read/write my ''store'' LVM that housed my ''ext4'' file-system. But the file-system isn't where it's supposed to be, apparently: 
- 
-<code> 
-root@ikari:~# file -Ls /dev/store_vg/store_lv  
-/dev/store_vg/store_lv: data 
-</code> 
- 
-Time to play with a hex editor again to look at a valid ''ext4'' file-system header (from my current Ubuntu installation, ''/dev/sda1''). 
- 
-After a bit of searching I find the characteristic ''53ef'' marker in the right column, followed by the name of my file-system: ''store''. Looks good. 
- 
-<code> 
-# xxd /dev/store_vg/store_lv | less 
-05c9db0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-05c9dc0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-05c9dd0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-05c9de0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-05c9df0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-05c9e00: 00c0 fc06 c017 f90d c9da b200 0c2a d00a  .............*.. 
-05c9e10: c604 f906 0000 0000 0200 0000 0200 0000  ................ 
-05c9e20: 0080 0000 0080 0000 0040 0000 9d17 aa4d  .........@.....M 
-05c9e30: 9d17 aa4d 0200 1b00 53ef 0100 0100 0000  ...M....S....... 
-05c9e40: 4c2c a24d 004e ed00 0000 0000 0100 0000  L,.M.N.......... 
-05c9e50: 0000 0000 0b00 0000 8000 0000 3400 0000  ............4... 
-05c9e60: 0600 0000 0300 0000 9495 5e9b 7d7e 41da  ..........^.}~A. 
-05c9e70: a595 6af4 4848 b5c6 7374 6f72 6500 0000  ..j.HH..store... 
-05c9e80: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-05c9e90: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-05c9ea0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-05c9eb0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-05c9ec0: 0000 0000 0000 0000 0000 0000 0000 c803  ................ 
-05c9ed0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-05c9ee0: 0800 0000 0000 0000 0000 0000 9a4a cd41  .............J.A 
-05c9ef0: 88ff 4759 ac42 d083 8b3f 3b0d 0201 0000  ..GY.B...?;..... 
-05c9f00: 0000 0000 0000 0000 cf4a 9f46 0906 0000  .........J.F.... 
-05c9f10: 0a06 0000 0b06 0000 0c06 0000 0d06 0000  ................ 
-</code> 
- 
-By comparing this to my valid file-system, I can see that the ''53ef'' line should be at offset 0x430: 
- 
-<code> 
-# xxd /dev/sda1 | less 
-0000380: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-0000390: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-00003a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-00003b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-00003c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-00003d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-00003e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-00003f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-0000400: 0020 2601 005d 9804 73d1 3a00 8366 bb02  . &..]..s.:..f.. 
-0000410: d8da 2001 0000 0000 0200 0000 0200 0000  .. ............. 
-0000420: 0080 0000 0080 0000 0020 0000 b153 264f  ......... ...S&O 
-0000430: d9bf f44e 0d00 2200 53ef 0100 0100 0000  ...N..".S....... 
-0000440: 98b4 f44e 004e ed00 0000 0000 0100 0000  ...N.N.......... 
-0000450: 0000 0000 0b00 0000 0001 0000 3c00 0000  ............<... 
-0000460: 4602 0000 7b00 0000 89bb eae1 b864 492d  F...{........dI- 
-0000470: a3d6 2bc9 5336 151e 0000 0000 0000 0000  ..+.S6.......... 
-0000480: 0000 0000 0000 0000 2f00 1eae 7c13 0000  ......../...|... 
-0000490: 0000 c099 98a1 0188 ffff 985d 698f 0188  ...........]i... 
-00004a0: ffff 307a 87a2 0188 ffff 307a 87a2 0188  ..0z......0z.... 
-00004b0: ffff 585c 698f 0188 ffff 645d 1681 ffff  ..X\i.....d].... 
-00004c0: ffff c000 588f 0188 0000 0000 0000 ed03  ....X........... 
-00004d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-00004e0: 0800 0000 0000 0000 b60a 4000 5547 eae9  ..........@.UG.. 
-00004f0: 2f3c 45ee 9913 70fb 20b7 7395 0101 0000  /<E...p. .s..... 
-0000500: 0000 0000 0000 0000 98b4 f44e 0af3 0200  ...........N.... 
-0000510: 0400 0000 0000 0000 0000 0000 ff7f 0000  ................ 
-0000520: 0080 4802 ff7f 0000 0100 0000 ffff 4802  ..H...........H. 
-0000530: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-0000540: 0000 0000 0000 0000 0000 0000 0000 0008  ................ 
-0000550: 0000 0000 0000 0000 0000 0000 1c00 1c00  ................ 
-0000560: 0100 0000 0000 0000 0000 0000 0000 0000  ................ 
-0000570: 0000 0000 0400 0000 b6b0 300b 0000 0000  ..........0..... 
-0000580: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-0000590: 0000 0000 0000 0000 0000 0000 0000 0000  ................ 
-</code> 
- 
-So let's create another loopback device with an offset: 
- 
-<code> 
-# losetup /dev/loop1 /dev/store_vg/store_lv --offset $((0x5C9A00)) 
-</code> 
- 
-For the record, that makes the current overall loopback settings: 
- 
-<code> 
-# losetup -a 
-/dev/loop0: [0005]:12783 (/dev/mapper/vg_scratch-snap), offset 134317568 
-/dev/loop1: [0005]:40851 (/dev/mapper/store_vg-store_lv), offset 6068736 
-</code> 
- 
-So is the ''ext4'' file-system visible? 
- 
-<code> 
-# file -Ls /dev/loop1 
-/dev/loop1: Linux rev 1.0 ext3 filesystem data, UUID=94955e9b-7d7e-41da-a595-6af44848b5c6, volume name "store" (needs journal recovery) (large files) 
-</code> 
- 
-Success! 
- 
-**Update**: I ended up writing a Python script, [[https://github.com/meermanr/ext3_recovery|find_ext3.py]], to help me locate ''ext3'' superblocks, and automatically check their validity using ''dumpe2fs''. 
- 
-====== Getting my data back ====== 
- 
-Obviously I tried mounting it first, to no avail: 
- 
-<code> 
-root@ikari:/tmp# mount /dev/loop1 /tmp/store 
-mount: wrong fs type, bad option, bad superblock on /dev/loop1, 
-       missing codepage or helper program, or other error 
-       In some cases useful info is found in syslog - try 
-       dmesg | tail  or so 
- 
-</code> 
- 
-Since this is all sitting on top of the LVM snapshot I made earlier (''/dev/vg_scratch/snap'') I am happy to try various tools that modify the drive contents, such as ''fsck''. 
- 
-First attempt, no joy: 
- 
-<code> 
-# fsck.ext4 -y /dev/loop1 
-e2fsck 1.41.14 (22-Dec-2010) 
-fsck.ext4: Group descriptors look bad... trying backup blocks... 
-fsck.ext4: Bad magic number in super-block when using the backup blocks 
-fsck.ext4: going back to original superblock 
-Error reading block 3226742528 (Invalid argument).  Ignore error? yes 
- 
-Force rewrite? yes 
- 
-Superblock has an invalid journal (inode 8). 
-Clear? yes 
- 
-*** ext3 journal has been deleted - filesystem is now ext2 only *** 
- 
-The filesystem size (according to the superblock) is 234428352 blocks 
-The physical size of the device is 187529782 blocks 
-Either the superblock or the partition table is likely to be corrupt! 
-Abort? yes 
- 
-Error writing block 3226742528 (Invalid argument).  Ignore error? yes 
-</code> 
- 
-Using my [[https://github.com/meermanr/ext3_recovery|find_ext3.py]] script I mapped out the location of all superblocks on my disk in the form of a log file as follows: 
- 
-<code> 
-# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) 
-# head vg_scratch-snap.log  
-OK  /dev/vg_scratch/snap 1024 store #0 4096 kB 32768 bpg, origin 0 
-BAD /dev/vg_scratch/snap 6358016 store #0 4096 kB 32768 bpg, origin 6356992 
-BAD /dev/vg_scratch/snap 6403072 store #0 4096 kB 32768 bpg, origin 6402048 
-BAD /dev/vg_scratch/snap 6444032 store #0 4096 kB 32768 bpg, origin 6443008 
-BAD /dev/vg_scratch/snap 6456320 store #0 4096 kB 32768 bpg, origin 6455296 
-BAD /dev/vg_scratch/snap 6525952 store #0 4096 kB 32768 bpg, origin 6524928 
-BAD /dev/vg_scratch/snap 6562816 store #0 4096 kB 32768 bpg, origin 6561792 
-BAD /dev/vg_scratch/snap 6640640 store #0 4096 kB 32768 bpg, origin 6639616 
-BAD /dev/vg_scratch/snap 6652928 store #0 4096 kB 32768 bpg, origin 6651904 
-BAD /dev/vg_scratch/snap 6722560 store #0 4096 kB 32768 bpg, origin 6721536 
-</code> 
- 
-Obviously the "OK" superblock at origin 0 isn't actually valid, according to ''fsck''. I suspect I've reformatted this drive a number of times, and unfortunately used the same label (''store'') each time, which doesn't help. So let's take a statistical approach: tally how often each origin offset is mentioned, and investigate the most frequent: 
- 
-<code> 
-# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) 
-# cut -d' ' -f9- ./vg_scratch-snap.log | sort | uniq -c | sort -rn | head 
-     22 origin 231936 
-     19 origin 0 
-      1 origin 880602744320 
-      1 origin 8802304 
-      1 origin 8790016 
-      1 origin 8679424 
-      1 origin 8667136 
-      1 origin 8556544 
-      1 origin 8544256 
-      1 origin 8433664 
-</code> 
- 
-As above, I mounted this on a loop back device and ran ''fsck'' on it. This got a lot further than previous attempts: more text scrolled past me, and it sat there for a while writing out a new journal. Ultimately it gave up with the same error as last time. 
- 
-On the advice of [[http://forums.gentoo.org/viewtopic-p-3778374.html#3778374|Gentoo forum post]] I ran 
- 
-<code> 
-mke2fs -S /dev/loop0 
-</code> 
- 
-followed by ''fsck''. That was last night. This morning it's still running, pinning one of my CPUs at 100%, consuming so much memory it has caused my system to nearly exhaust its swap file and nearly consume all of the snapshot volumes Copy-On-Write table! 
- 
-===== Lowering IO and CPU scheduling priority of fsck ===== 
- 
-<code> 
-# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) 
-# ps fo pid,pmem,pcpu,cmd -t 13 
-  PID %MEM %CPU CMD 
-19540  0.0  0.0 -bash 
-19666  0.0  0.0  \_ sudo su 
-19667  0.0  0.0      \_ su 
-19675  0.0  0.0          \_ bash 
-20181  0.0  0.0              \_ fsck /dev/loop0 -y 
-20182 49.4 86.8                  \_ fsck.ext2 -y /dev/loop0 
-</code> 
- 
-To keep my system usable I lowered the IO and CPU priority of ''fsck''. First change the IO scheduling class to "idle" (3) for the hungry process: 
- 
-<code> 
-root@Ikari:~# ionice -c3 -p 20182 
-</code> 
- 
-Then raise the "niceness" of the process. Higher values make processes nicer, which means they are more likely to "give way" to other processes. Really it just means the kernel will pre-empty nice processes more often: 
- 
-<code> 
-root@Ikari:~# renice 10 20182 
-20182 (process ID) old priority 0, new priority 10 
-</code> 
- 
-===== Adding more swap to my system ===== 
- 
-<code> 
-# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) 
-# free -m 
-             total       used       free     shared    buffers     cached 
-Mem:          5969       5523        445          0        325        117 
--/+ buffers/cache:       5080        888 
-Swap:         6234       3477       2757 
-</code> 
- 
-I was concerned to see that 50% of my swap was in use. I don't know how long ''fsck'' will take, so adding more swap seems prudent. As luck would have it, I recently added an solid state drive (SSD) to my system, so I have an unused spinning disk which I'm pretty sure has a swap partition on it that isn't doing anything. 
- 
-<code> 
-# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) 
-# lsblk 
-NAME                                  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT 
-loop0                                   7:   0 894.3G  0 loop  
-sda                                     8:   0  93.2G  0 disk  
-├─sda1                                  8:1    0  87.2G  0 part / 
-├─sda2                                  8:2    0     1K  0 part  
-└─sda5                                  8:5    0     6G  0 part [SWAP] 
-sdb                                     8:16   0 298.1G  0 disk  
-├─sdb1                                  8:17   0 294.1G  0 part  
-├─sdb2                                  8:18       1K  0 part  
-└─sdb5                                  8:21       4G  0 part 
-sr0                                    11:0    1  1024M  0 rom   
-sr1                                    11:1    1  1024M  0 rom   
-sde                                     8:64     1.8T  0 disk  
-└─sde1                                  8:65     1.8T  0 part  
-  ├─vg_scratch-home_zfs (dm-0)        252:0    0 186.3G  0 lvm   
-  ├─vg_scratch-lv_scratch-real (dm-3) 252:3    0 894.3G  0 lvm   
-  │ ├─vg_scratch-lv_scratch (dm-2)    252:2    0 894.3G  0 lvm   
-  │ └─vg_scratch-snap (dm-1)          252:1    0 894.3G  0 lvm   
-  ├─vg_scratch-snap-cow (dm-4)        252:4    0  18.6G  0 lvm   
-  │ └─vg_scratch-snap (dm-1)          252:1    0 894.3G  0 lvm   
-  └─vg_scratch-photorec (dm-5)        252:5    0 763.9G  0 lvm   
-</code> 
- 
-It's probably ''/dev/sdb5'', let's check with ''file'': 
- 
-<code> 
-# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) 
-# file -Ls /dev/sdb5 
-/dev/sdb5: no read permission 
- 
-# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) 
-# sudo !! 
-sudo file -Ls /dev/sdb5 
-[sudo] password for meermanr:  
-/dev/sdb5: Linux/i386 swap file (new style), version 1 (4K pages), size 1048063 pages, no label, UUID=d0bbff73-a09a-47f6-8387-e27268cdc9fc 
-</code> 
- 
-Great! Let's enable it! 
- 
-<code> 
-# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) 
-# sudo swapon /dev/sdb5 
-</code> 
- 
-And verify: 
- 
-<code> 
-# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) 
-# lsblk | grep SWAP 
-└─sda5                                  8:5    0     6G  0 part [SWAP] 
-└─sdb5                                  8:21       4G  0 part [SWAP] 
-</code> 
- 
-<code> 
-# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) 
-# free -m 
-             total       used       free     shared    buffers     cached 
-Mem:          5969       5672        297          0        435        138 
--/+ buffers/cache:       5098        871 
-Swap:        10234       3476       6758 
-</code> 
- 
-===== Extending the snapshot volume ===== 
- 
-<code> 
-# lvdisplay /dev/vg_scratch/snap 
- 
-  --- Logical volume --- 
-  LV Name                /dev/vg_scratch/snap 
-  VG Name                vg_scratch 
-  LV UUID                4EFJ8Y-bzWT-aif4-MlT9-4234-aS1d-qcipq0 
-  LV Write Access        read/write 
-  LV snapshot status     active destination for /dev/vg_scratch/lv_scratch 
-  LV Status              available 
-  # open                 1 
-  LV Size                894.27 GiB 
-  Current LE             228934 
-  COW-table size         9.63 GiB 
-  COW-table LE           2335 
-  Allocated to snapshot  74.40%       <-- Do not want! 
-  Snapshot chunk size    4.00 KiB 
-  Segments               1 
-  Allocation             inherit 
-  Read ahead sectors     auto 
-  - currently set to     256 
-  Block device           252:1 
-</code> 
- 
-As it happens, I had not allocated all of the volume group: 
- 
-<code> 
-root@Ikari:~# vgs 
-  VG         #PV #LV #SN Attr   VSize VFree 
-  vg_scratch       1 wz--n- 1.73t    9.3g 
-</code> 
- 
-So extending the snapshot is easy: 
- 
-<code> 
-root@Ikari:~# lvextend /dev/vg_scratch/snap --extents +100%FREE 
-</code> 
- 
-Verify: 
- 
-<code> 
-# lvdisplay /dev/vg_scratch/snap 
-  --- Logical volume --- 
-  LV Name                /dev/vg_scratch/snap 
-  VG Name                vg_scratch 
-  LV UUID                4EFJ8Y-bzWT-aif4-MlT9-4234-aS1d-qcipq0 
-  LV Write Access        read/write 
-  LV snapshot status     active destination for /dev/vg_scratch/lv_scratch 
-  LV Status              available 
-  # open                 1 
-  LV Size                894.27 GiB 
-  Current LE             228934 
-  COW-table size         18.63 GiB      <-- Has increased 
-  COW-table LE           4769           <-- (Same thing, but measured in logical extents) 
-  Allocated to snapshot  39.18%         <-- Much better! 
-  Snapshot chunk size    4.00 KiB 
-  Segments               1 
-  Allocation             inherit 
-  Read ahead sectors     auto 
-  - currently set to     256 
-  Block device           252:1 
-</code> 
- 
-===== ... 2 weeks later ===== 
- 
-It has been two weeks since I started ''fsck'', and it is still running. During this time I've not been able to use my desktop PC for gaming, and so I've decided to hit ^C and move it to another machine. Here are the forensics: 
- 
-Now: 
- 
-<code> 
-# meermanr@Ikari:/home/meermanr (master *) 
-# date 
-Mon May 27 14:53:34 BST 2013 
-</code> 
- 
-Size of block device: 
- 
-<code> 
-root@Ikari:/home/meermanr/projects/find_ext3# python 
-Python 2.7.3 (default, Aug  1 2012, 05:14:39)  
-[GCC 4.6.3] on linux2 
-Type "help", "copyright", "credits" or "license" for more information. 
->>> f = open("/dev/loop0") 
->>> f.seek(0, 2) 
->>> f.tell() 
-960218560000 
->>> hex(f.tell()) 
-'0xdf917c7600' 
->>>  
-</code> 
- 
-Offsets of ''fsck'' and ''python'': 
- 
-<code> 
-Every 2.0s: lsof /dev/loop0                                                       Mon May 27 14:56:18 2013 
- 
-COMMAND     PID USER   FD   TYPE DEVICE     SIZE/OFF NODE NAME 
-fsck.ext2 20182 root    4u   BLK    7,0 0x3f903ef000 5941 /dev/loop0 
-python    23598 root    3r   BLK    7,0 0xdf917c7600 5941 /dev/loop0 
-</code> 
- 
-That's approximately 28%. :-( 
- 
-<code> 
- STARTED %CPU %MEM   RSS CMD 
-  May 13  0.0  0.0   364 su 
-  May 13  0.0  0.0   528  \_ bash 
-  May 13  0.0  0.0  1320      \_ watch lvdisplay /dev/vg_scratch/snap 
-  May 13  0.0  0.0   364 su 
-  May 13  0.0  0.0   536  \_ bash 
-  May 13  0.0  0.0   464      \_ fsck /dev/loop0 -y 
-  May 13 99.1 24.3 1488468          \_ fsck.ext2 -y /dev/loop0 
-  May 14  0.0  0.0  1456 watch lsof /dev/loop0 
-</code> 
- 
-So ''fsck'' has about 1.4GiB of resident memory (and 4696MiB of virtual memory according to top): 
- 
-<code> 
-  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                      
-20182 root      30  10 4696m 1.4g  784 R   94 24.3  19564:51 fsck.ext2 -y /dev/loop0   
-</code> 
- 
-Total system memory: 
- 
-<code> 
-# meermanr@Ikari:/home/meermanr (master *) 
-# free -m 
-             total       used       free     shared    buffers     cached 
-Mem:          5969       5036        933          0         87        157 
--/+ buffers/cache:       4791       1178 
-Swap:        10234       4764       5470 
-</code> 
- 
-<code> 
-Every 2.0s: lvdisplay /dev/vg_scratch/snap                                       Mon May 27 14:54:06 2013 
- 
-File descriptor 4 (pipe:[72594953]) leaked on lvdisplay invocation. Parent PID 23517: sh 
-  --- Logical volume --- 
-  LV Name                /dev/vg_scratch/snap 
-  VG Name                vg_scratch 
-  LV UUID                4EFJ8Y-bzWT-aif4-MlT9-4234-aS1d-qcipq0 
-  LV Write Access        read/write 
-  LV snapshot status     active destination for /dev/vg_scratch/lv_scratch 
-  LV Status              available 
-  # open                 1 
-  LV Size                894.27 GiB 
-  Current LE             228934 
-  COW-table size         84.89 GiB 
-  COW-table LE           21733 
-  Allocated to snapshot  43.48% 
-  Snapshot chunk size    4.00 KiB 
-  Segments               2 
-  Allocation             inherit 
-  Read ahead sectors     auto 
-  - currently set to     256 
-  Block device           252:1 
-</code> 
- 
-Output from ''fsck'' itself: 
- 
-<code> 
-File ... (inode #9791282, mod time Thu Oct  5 01:40:26 2006)  
-  has 11143 multiply-claimed block(s), shared with 5 file(s): 
- <filesystem metadata> 
- ... (inode #9791794, mod time Thu Oct  5 01:40:26 2006) 
- ... (inode #4115835, mod time Thu Aug 20 03:31:06 2009) 
- ... (inode #4130006, mod time Mon Nov 29 16:38:10 2010) 
- ... (inode #4784754, mod time Tue Jul 26 06:01:10 2005) 
-Clone multiply-claimed blocks? yes 
-</code> 
unix/lvm_recovery.txt · Last modified: 2013/08/20 22:54 (external edit)