xfs on lvm on hardware RAID: correct parameters?

sebschub asked:

I have 10 disks with 8 TB each in a hardware RAID6 (thus, 8 data disks + 2 parity). Following the answer of a very similar question, I hoped for an automatic detection of all necessary parameters. However, when creating the XFS file system at the end, I got

# mkfs.xfs /dev/vgdata/lvscratch 
meta-data=/dev/vgdata/lvscratch  isize=256    agcount=40, agsize=268435455 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0        finobt=0
data     =                       bsize=4096   blocks=10737418200, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

This looks like that striping has not been used. Due to the different terms I found on different sites (strip size, stripe size, stripe chunk, …), I would like to ask whether I got the manual parameters right.

The RAID 6 has been set-up with a strip size of 256KB:

# ./storcli64 /c0/v1 show all | grep Strip
Strip Size = 256 KB

Thus, the stripe size is 8*256KB = 2048KB = 2MB. Is this correct? According to this (and if I understand it correctly), the pvcreate has to use the strip (or chunk) size as argument to dataalignment:

# pvcreate --dataalignment 256K /dev/sdb
  Physical volume "/dev/sdb" successfully created

Note that I used the whole RAID device without partitions. Now a

# vgcreate vgdata /dev/sdb
  Volume group "vgdata" successfully created

with a default PE Size of 4MB should be fine because it is a multiple of the stripe size of 2MB. Correct?

Now, a part of the vgroup is assigned to a logical volume:

# lvcreate -L 40T vgdata -n lvscratch 
  Logical volume "lvscratch" created.

Finally, the file system is created but now with the correct arguments (stripe size of 2MB, stripe width of 8):

# mkfs.xfs -d su=2048k,sw=8 /dev/vgdata/lvscratch 
meta-data=/dev/vgdata/lvscratch  isize=256    agcount=41, agsize=268434944 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0        finobt=0
data     =                       bsize=4096   blocks=10737418240, imaxpct=5
         =                       sunit=512    swidth=4096 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Is this approach correct? Is there anything to keep in mind for an extension of the logical volume or the volume group? I suppose that if the volume group would be extended with another RAID6 system, the strip size should be equal to the present RAID6.

EDIT: My confusion seems to be mainlz based on the different usage of terms connected to stripe. The manufactor of my RAID controller, LSI or Avago, defines the terms in the following way:

Stripe Width

Stripe width is the number of drives involved in a drive
group where striping is implemented. For example, a four-disk drive
group with disk striping has a stripe width of four.

Stripe Size

The
stripe size is the length of the interleaved data segments that the
RAID controller writes across multiple drives, not including parity
drives. For example, consider a stripe that contains 64 KB of disk
space and has 16 KB of data residing on each disk in the stripe. In
this case, the stripe size is 64 KB, and the strip size is 16 KB.

Strip Size

The strip size is the portion of a stripe that resides on a
single drive.

Wikipedia (and IBM) seem to use other definitions:

The segments of sequential data written to or read from a disk before
the operation continues on the next disk are usually called chunks,
strides or stripe units, while their logical groups forming single
striped operations are called strips or stripes. The amount of data in
one chunk (stripe unit), often denominated in bytes, is variously
referred to as the chunk size, stride size, stripe size, stripe depth
or stripe length. The number of data disks in the array is sometimes
called the stripe width, but it may also refer to the amount of data
within a stripe.

The amount of data in one stride multiplied by the number of data
disks in the array (i.e., stripe depth times stripe width, which in
the geometrical analogy would yield an area) is sometimes called the
stripe size or stripe width. Wide striping occurs when chunks of
data are spread across multiple arrays, possibly all the drives in the
system. Narrow striping occurs when the chunks of data are spread
across the drives in a single array.

Even in the Wikipedia text above stripe size is used with two different meanings. However, I suppose now, when creating the xfs file system, the size of a single chunk stored on a single drive has to be given as argument to su. This, it should be mkfs.xfs -d su=256k,sw=8 in the command above. Correct?

My answer:


Rather than “strip size” and “stripe size”, the XFS man pages use the terms “stripe unit” and “stripe width” respectively.

This makes it possible to decode the otherwise confusing text in the mkfs.xfs(8) man page:

               sunit=value
                      This is used to specify the stripe unit for  a  RAID
                      device  or  a  logical  volume.  The value has to be
                      specified in 512-byte block units. Use the su subop‐
                      tion  to specify the stripe unit size in bytes. This
                      suboption ensures  that  data  allocations  will  be
                      stripe  unit aligned when the current end of file is
                      being extended and the  file  size  is  larger  than
                      512KiB.  Also inode allocations and the internal log
                      will be stripe unit aligned.

               su=value
                      This is an alternative to using sunit.  The su  sub‐
                      option is used to specify the stripe unit for a RAID
                      device or a striped logical volume. The value has to
                      be  specified  in  bytes,  (usually using the m or g
                      suffixes). This value must  be  a  multiple  of  the
                      filesystem block size.

So, with your array reporting a strip size of 256KiB, you would specify either su=256K or sunit=512 (because 512 512-byte blocks equals 256KiB).

               swidth=value
                      This  is used to specify the stripe width for a RAID
                      device or a striped logical volume. The value has to
                      be  specified  in  512-byte  block units. Use the sw
                      suboption to specify the stripe width size in bytes.
                      This  suboption  is  required  if  -d sunit has been
                      specified and it has to be  a  multiple  of  the  -d
                      sunit suboption.

               sw=value
                      suboption is an alternative to using swidth.  The sw
                      suboption is used to specify the stripe width for  a
                      RAID  device or striped logical volume. The value is
                      expressed as a multiplier of the stripe  unit,  usu‐
                      ally the same as the number of stripe members in the
                      logical volume configuration, or  data  disks  in  a
                      RAID device.

                      When  a  filesystem  is  created on a logical volume
                      device, mkfs.xfs will automatically query the  logi‐
                      cal volume for appropriate sunit and swidth values.

With 10 spindles (8 data, 2 parity) you would specify either sw=8 (data spindles) or swidth=2M (the strip size multiplied by data spindles).

TL;DR:

The easiest way to specify these is usually by strip size and spindle count, thus su= strip size and sw= spindle count.


View the full question and answer on Server Fault.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.