Home NAS setup with Thecus N5550
(tl;dr: use latest rdup for backups, and thecus has problems with some drives, even from the compatibility list, if so, try updating firmware).
So I recently saw a reasonably good deal on the Thecus N5550, a 5-bay hot-swappable NAS machine. This seemed as good a time as any to finally clean up my backups and Do Them Right.
So the plan is to see how much the already excellent rdup has progressed (last time I used it was at version 0.5, it is now at 1.1.13, the version from the repositories on my systems was 1.0.5).
Initial setup for the N5550 is quite OK, I setup NTP, NFS access, Samba access, and enabled SSH login to have a lowlevel look at what’s going on inside.
Setting up the RAID array was a breeze; For my first experiment I chose a RAID-5 array of 3 Seagate 3TB disks, totalling in about 5.5TB of space.
I installed rdup on my vps, and on a local always-on machine (a lowpower fanless box acting as a fancy gateway), and initially I mounted the big drive as a samba mount (there were some unidentified problems with nfs4, see below). Read and write speed were not as high as I would have thought, about 1-3 MB/s. But that was a problem for later, and I don’t really need that much speed anyway.
I only needed simple incremental backups, so the rdup-simple script should be fine.
Then the first problem arised; there were a lot of ’no such file or directory’ errors. After retracing the steps rdup-simple takes manually, I found out that sometimes rdup-up missed creating a directory. And then of course when it tries to write files in it it fails. This happens for different directories on every run, so I suspect this is some form of race condition.
I asked around and a friend who uses it says he had never seen this with his version (1.1.11 IIRC).
Time to upgrade.
Ater getting rid of the repository versions, and building a fresh rdup from the latest release (long live apt-get build-dep!), rdup-simple proceeded without a hitch.
I also has a number of media files I wanted to copy, so I simply scp’d them to the mounted share.
And then the real trouble started.
While it was copying one of the video files, scp aborted with an I/O error. I tried to copy again, and this time the failed file worked. But shortly after that, while copying another big file, the thecus started beeping continuously.
Looking at the diagnostics, it reported a failing disk (drive 2). Since I had already run the smart checks on this brand new disk, I didn’t think it was actually dying on me, so I tried to find what was really happening. Good thing I had that SSH access; dmesg showed lots and lots of problems:
Periodic exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen from sata drive hard resetting link
followed by a number of messages:
device reported invalid CHS sector 0
And of course the RAID was in a degraded mode. Telling it to fix itself started a rebuild. As you may now, RAID rebuilds take a long long time, and after about 10 hours (at 92 percent), disks 1 and 3 dropped with similar errors. Bye bye RAID.
This seems to be a problem with the driver, as similar problems for other systems are all over the web. It is often blamed on bad cabling, but there are no cables in this machine (the sata connections are directly wired on a board, and if that board was bad, I’d expect to see much, much more complaints on the Thecus forum).
Another thing I noticed is that these are 6GB/s disks, and the thecus should handle them fine (they are on the compatibility list), but they are connected at 3.0GB/s, not 6.0.
This might be related.
So, I almost contemplated getting rid of the firmware and trying to install a real OS on the system. But there was one thing else to try first.
The version of the disk firmware was CC4B, and there were some reports with other Thecus machines that depending on the firmware version, some disks of the same model worked and some did not.
I put my disks in another machine and updated firmware to CC4H. Fun!
Steps updating seagate disks:
- Boot my main Linux machine to windows for Seagate firmware update tool
- firmware update tool reboots the machine into a mini linux version
- which updates the disk
- then reboots the machine (defaulting to main linux)
- Repeat for other disks
Shoving the disks back into the Thecus, I got a small disappointment; the disks were still negotiated to 3 GB/s. Oh well.
Rebuilt the RAID array (this time as a RAID-1, I care more about redundancy than write speed and size). And this time I found out the nfs problem (or at least a workaround); when mounting with nfs4, the shares got a weird UID, and Invalid Argument if you try to chown something. However, mounting it using nfs3 makes it work.
Using RAID-1 and nfs(3) gave me 10 MB/s write speed and 2 GB/s read speed. Now that is more like it! (Again, don’t care too much about write speed, but 10 MB is a lot better than 1). (EDIT: hmm, that was with a zeroed test file I just created, for normal files it appears to be about 10 MB/s read speed as well).
I have been stresstesting it all day now, and have just started the backups again. No disk I/O errors so far.
So if you get those, before doing anything drastic, check your disk firmware updates.