Deciphering continuing mpt2sas syslog messages

Chris Smith asked:

Summary

I have been getting these cryptic messages in syslog since I installed some new hardware and I can’t figure out what the problem is, if it’s serious, or what to do about it.

They’re from the new SATA HBA and they follow a pattern. I will get several of the first message followed by several of the second message 5-30 seconds later. They come in blobs that are all logged in the same second and the exact amount of each varies between about 2 and 35. It can be minutes or hours between appearances of the entries.

Example of the two messages:

Jul 13 06:06:23 durandal kernel: [366918.435596] mpt2sas0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Jul 13 06:06:28 durandal kernel: [366923.145524] mpt2sas0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)

It is always always 0x31120303 followed by 0x31110d01.

mpt2sas is the driver for the SATA host bus adapter I’m using but the error content is overly cryptic. It doesn’t tell me what the problem is, what disk or port it is with or how severe it is.

Hardware

Supermicro X9SCL with a Xeon E3-1220 and 8GB of RAM.

LSI SAS2008 based Supermicro AOC-USAS2-L8I SAS/SATA HBA connected to a Supermicro CSE-M35T-1B disk tray set. It has three Western Digital WD30EZRX and two Segate ST3000DM001 plugged into it. All 3TB drives (exact same number of sectors actually). No port expanders in use.

The HBA, disk trays and 4 of the drives are new. One of the WD30EZRXes has been in for months, had no problems with it. Had it connected to the integrated Intel SATA controller previously, moved it into the drive bays with this new setup.

Had problems with the HBA needing to reset frequently and getting really awful performance. Updated the firmware/bios to “Phase 12”, the latest release available from Supermicro and changed the type to IT (i.e. passthrough, from IR for integrated raid since I was going to use all software raid): 2008IT12.FW. That update cleared up all the early issues and I didn’t start getting the above messages until later (see below).

The first four disks I added are all on the first SFF-8087 port (split to 4 SATA cables). The latest disk I added is on the other port, if that matters.

The only other disk on the system contains the OS, and is a older Intel 80GB SSD plugged into the integrated SATA controller.

Software

Ubuntu 11.10 (oneiric). Linux 3.0.0-14-server x86_64. Using the mpt2sas driver that comes with the OS.

Trying to build a RAID6 array using Linux md with those five disks. Started with a degenerate array of 3 disks, the two Segates and one of the new WD drives. This was fast and went very well, no messages in the logs after I did the firmware update. Meanwhile, I am still using the old WD disk on port 0 of the same controller.

Added the other new WD disk to the array. Rebuild started and I am now getting those messages in syslog periodically. I’m not sure how long it’s supposed to take to add a disk to the array but the estimated time (cat /proc/mdstat) ranges from thousands to tens of thousands of minutes, much longer than it took the first 3 disks. I do understand that the WD disks are much slower; I got different models to cut down on the chances of multiple disk failure, and those were the two cheapest 3TB models.

Notes

SMART does not report any problems on any disks. There are no logged errors on any disks and none of the failure stats are anywhere near threshold.

The logged messages only started appearing after I added the last disk, which suggests that one may be having a problem but I have nothing else pointing to that.

I did find a header file that seems to correspond to the logging messages from this driver. The first message seems to be an abort (code 12) for a “sub code” 0303 that isn’t listed. The second message is a reset (code 11) for a reason that also isn’t clear. If I could determine what 0303 and 0d01 mean, that would be really helpful.

I know that 4 disks in a 5 disk RAID6 is an incomplete array. I’m planning to copy the contents of the old disk to the array once it finishes integrating the 4th disk and then add the old disk to the array as well.

My answer:


Wow, a tough one.

This seems to indicate that 0x31120303 is a bus reset due to one of your devices being under heavy load. It also says you don’t need to worry about it. (Haha, yeah right.)

This indicates that these log messages are happening because one of your devices is taking too long to respond to commands. This says the same thing, and also indicates it occurs under heavy load.

While this isn’t a complete answer, it hopefully will point you in a useful direction.


View the full question and answer on Server Fault.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.