26 October, 2015

NAS migration and checking with md5

Recently I have migrated from my home made NAS server (no-name desktop PC with Ubuntu 1404 Server) to a sweet Synology DiskStation 414. I this post I will write about the migration process, about how I have migrated from a 10 TB LVM configuration to a 12 TB Raid5 configuration and how I have checked, using md5 hashes, if all files were copied without any error.

In the old configuration I had 2*4TB hard disks and one 2 TB hard disk; in the new configuration I planned to have 4*4 TB HDDs, and utilizing the 2*4 TB drives from the old configuration. I was very afraid of loosing any data, therefore the normal approach to install the 2 new 4 TB hard disks into the new NAS in Raid5 configuration, then copy 4 TB data from the old NAS, then move one 4 TB HDD from the old machine, then copy again 4 TB and move the last 4 TB HDD to the new machine did not worked, because I was afraid, that during the Raid extension or during the LVM reconfiguration I will do something wrong and loose data. Therefore I got additional 10 TB storage in 4 hard disks, copied all the content from the old NAS to these disks, then moved the 2*4 TB disks to the new NAS and then copied all the data from the additional disks to the new NAS.

Before starting the copy I made md5 hashes for the old NAS, after copying all data to the additional disks I have made md5 hashes and at the end I made md5 hashes on the new NAS. To compare if the hashes are Ok, I used Microsoft Access.

EDIT: I used cp -r  to copy the files, but this sets the file times to the actual date, I should have used cp -r --preserve=timestamps.

And here are the leanings:

  1. I used du -h -d 1 directory or du --si -d 1 directory to check the size of the directories when I needed to split the 10 T data to 4 different HDD. The second form is better because the size of the HDDs is always given in 10 based gigabytes or terabytes.
  2. I used find "directory" -type f -print0 | xargs -0 md5sum -b >md5file.txt to create the md5 sums for a directory structure. It creates a text file with the md5 hashes and the file names. In my case I had altogether more than 1,5 million files, the size of this file was about 300 megabytes.
  3. I made the hash for the original NAS, and also for all the for additional HDD where the backups were stored.
  4. Access has a file size limitation of 2 gigabytes, with this project I always run into this. If you get some very strange errors in Access, most probably it is because you have run into this limitation.
  5. When importing the md5 files, during the import cut out the part of the file name which is different for the different locations, to be able to compare the file names better. 
  6. When importing set the character set to UTF-8 and check if the length off the file name field is long enough. Access tends to guess the length based on the beginning of the file, and cuts away the end of longer file names. I have set it always manually to 255.
  7. The process what I did was to create a separate table with the name of the main directories I copied to different disks. Then I added this directory name to the md5 and file names table to see which files belong where (if it belongs to multiple nested directories, then more records for the same file name were created) and finally I compared these databases from the different drives and summarized the files by directory. There were 3 different results: -1, 0 and nothing. -1 means matching md5 , 0 means not matching md5 and empty means missing files. For each directory I got 3 figures: files which were copied Ok, files where the copy was not successfull and files I forgot to copy.
  8. I did not do it, but instead of storing the directory names to the database, I should have only stored the id of the directory, this would reduce significantly the database size.
  9. I have not used, but Unicode compression was switched off in Access, enabling this would also reduce the size of the database.
  10. I used linked Access databases, to handle the 2 gigabyte file limit, and very often if a complicated query was used, I run it in create table mode to speed up the next queries using it's result.
And finally it was worth of doing it, I have found some files, where errors were made during copying.


Steve8x8 said...

This is an impressive Todo list. Would "rsync" have made it shorter? I guess, yes.
Did you identify the reasons why files got corrupted during transfer?

Lacó said...

Hi Steve8x8!

You are probably right with rsync, I have read somewhere that it is relatively slow, therefore I used plain cp to copy the contents. However counting in the time I used to generate the md5 hashes, and the really lot of hours spent in trying to get right with Access, using cp was definitely slower. But doing this md5 checking I am sure that the contents is the same.

There were two type of transfer errors I have found:

1. On the original NAS there were 2 corrupted files, probably because of hdd error, cp was able to read them but the result was different from the original
2. As the copy of several terabytes takes several days, I accidentally interrupted the copy in some cases, then I continued with cp -n. The files, where the copy was interrupted got corrupted.


Steve8x8 said...

OK, (2) would have been covered by rsync - it would create temporary file names and only rename those on success. You might have some additional files in your copy, but rsync --delete would have gotten rid of them.
(1) still leaves me confused. If a file was (partially) unreadable, that would perhaps explain a difference in md5 sums. I'd expect cp to stop at some point...

Lacó said...

I have checked the speed of cp vs rsync:

I had a directory with photos and videos 55 GB, 8392 files in 107 folders. I have mounted the NAS over SMB to a local mount point and got the following results:

cp 1033 seconds ~53,2 MB/sec
rsync 1198 seconds ~45,9 MB/sec
m5 on NAS 688 seconds ~79,9 MB/sec
md5 local 489 seconds ~112,5 MB/sec

My conclusion from this is, that rsync is slightly slower (15%) that cp, but not significantly, so if there is some benefit of using rsync, the speed is not a limiting factor.