It may be possible that administrative problems similar to those described in the article Larger sector size drives (4K and "Advanced Format") can happen with other hardware – for example, some SSDs claimed 8K sectors and manipulation of 256K-512K pages for reprogramming of their "Flash"-chips, or SAN LUNs (including those backed by ZFS volumes over iSCSI or FC) can operate most efficiently with blocks of a specific size. Misalignment may place an unnecessary toll on IO performance (by regularly having logical blocks contained across different sectors and requiring more IO operations and bandwidth to access, as well as requiring RMW operations – and delays – to update), and possibly on storage reliability (more wear on SSD devices due to more required IO, and in particular misaligned ZFS pool labels might get "torn" across too many hardware sectors and pages).
Good alignment usually means that the absolute starting offset from the start of disk of the
s0 slice (of Solaris disk label) dedicated to the
rpool, encapsulated in its MBR partition, should be divisible by the hardware sector size, and size of this slice should be a multiple of the "cylinder size" (as defined by Solaris
format for this disk's type) and of the hardware sector size as well.
NOTE: Speculation and guesswork in the paragraph below
Depending on the (generally unknown) implementation of the storage controller on SSD devices, it may seem (I am not really sure about "must be") desirable to configure your
rpool device, and possibly other partitions on this storage device, such as for ZIL and L2ARC usage for the data pool in smaller setups like a SOHO NAS server, in such a manner that IO's and storage of important system components (such as the pool labels) are well-aligned considering not only the sector size, but also the page size. In case of SSDs with a page size of 256Kb it seems natural to desire that the four ZFS label copies are 256K-aligned in terms of physical offsets. Possibly, for 512K pages steps should be taken to assure that the four labels fall into different pages as well (playing with odd and even numbering in 256K alignment). Although, arguably, wear-leveling in modern SSD firmware and colocation of different LBAs in the same hardware pages may negate this particular effort.
First of all, let's go over the technical basics – what do these layouts look like as offsets of the on-disk allocation?
|LBA 0, Physical offset 0 (rarely 1) – Start of MBR (the MBR table itself is in sector number 0, along with the initial bootloader code)|
Start of MBR partition 1 (which in our example contains the Solaris SMI labels and the
The slice #2
|The slice #0 for |
|The end of slice #0 for |
|Optional other slices in this label (maybe for ZIL or L2ARC devices, or components of other pools – usable by Solaris/illumos only)|
|2 "reserved" cylinders (optional as decided by the OS?)|
|Optional other partitions for other OSes (dual-booted systems), or for ZIL (|
formatcommand interface. The slice sizes are multiples of cylinder size.
rpooldisks' connection type (i.e. from Legacy IDE to Native SATA) or extending a mirror, which might cause two identical drives to be assigned different "types" and, in my experience, different cylinder sizes (so the SMI table could not be copied from one disk to another, because the slice sizes are counted in full cylinders). Setting the new disk's "type" to that of the old disk did fix the problem for me.
p1starting at LBA #63 and would use "clusters" of 4K or more; with a one-sector shift such partitions would start at a hardware offset of 64 "legacy sectors", which is a multiple of these disk's hardware sector size.
rpoolor a mirrored
dumppool or even "raw device(s)", and the rest of the disks used as a 4-device
raid10pool. If dual-booting is not a requirement, this is all best orchestrated by a single MBR partition spanning the whole disk, and a Solaris SMI slice-table inside it. On the other hand, components of non-root pools (including data or cache/log leaf VDEVs) are not required to be SMI slices, and may be MBR or GPT partitions...
rpooland L2ARC or ZIL devices, at least partially due to wear that the frequently written cache devices would induce on the media which also houses your operating system. If possible, store the
rpooland the caches on different mirrored pairs, at least on production systems.
As detailed in the ZFS On-Disk Specification (page 7), the VDEV layout (content of the
rpool slice) starts and ends with copies of the device labels:
L3 are copies of the labels which are atomically updated in several separate IOs (
L3), so that at least one is probably intact in case of failures during a label update. The second half (128K) of each label contains a "ring of slots" for "Uberblocks", and the hardware sector size (or rather the
ashift value) determines how many UB's fit there and thus how many rollback transactions may be possible – from 128 slots for 512b/512e devices down to 32 slots for 4k devices.
The hands-on math
Now, there is some math to do
The trick of this setup is that the slice used for
rpool should start and end at the hardware-sector aligned offset, or even a page-aligned offset for SSDs; however the slice size should be a multiple of the "cylinder size" used by the OS for a particular disk.
First, you should create an MBR partition (with
fdisk, including an
fdisk invocation from
format) and a slice table in it (with
format), just to see whether your OS instance would require the
alternates slices for the drive in question, and what cylinder sizes it would assign:
So, from the example partitioning above we can find that the system would assign the two reserved cylinders at the tail, as well as one overhead
boot slice and two
alternates slices, with the cylinder size being
16065 blocks (of 512 bytes each). For the curious, this layout came from an SSD disk attached in LegacyIDE mode. Its twin, which arrived later and was attached straight to a SATA port, offered a very different layout (and disk type) in
format. And the example is slightly edited – the actual disk was over 100Gb in size, but the partition for tests was manually created as a smaller one.
Typical cylinder sizes are 16065 blocks and sometimes a value around 12 thousand, though on VirtualBox I often see 4096 blocks. The cylinder size can be verified in
fdisk command as well (provide
p0 as the device); a VirtualBox example follows:
Continuing with the first example, let's assume that we use an SSD with 256KB pages (512 legacy sectors of 512 bytes) and 4Kb declared sector size (8 legacy sectors), and the disk is known or assumed to use a non-shifted absolute addressing (zero absolute offset is the real start of its hardware addressing for the purposes of alignment).
For some reason we do care about legacy partitioning tool compatibility and use track-sized offsets for the partition start and size (one track = 63 legacy sectors). And with all that, we want to hit the
s0 start aligned right on a multiple of 4096 bytes, and have the partition sized a whole amount of SSD pages).
L>0 tracks =
3 cylinders =
1 cylinder =
1 cylinder =
|N pages = N * 512 sectors|
|M*tracks*pages = M * 63 * 512 sectors|
The table above, hopefully, describes all the values involved in this math. As a result of the desired layout, the starting offset and the size of
p1 partition will be a multiple of "track" size, while the start of the
p2 partition will also be SSD page-aligned and track aligned, with a minimal gap between the two. This keeps other partitioning tools happy, and IOs to other partitions should be efficient.
So, the absolute shift from start of disk to start of
s0 is a multiple of 256KB (or
N*512 legacy sectors), and it should be no less than one track plus 3 reserved cylinders (or
63*(1+3*255)). The latter "lower bound" amounts to
63*766=48258 legacy sectors, or roughly 94.3 pages (of 256KB). This offset should also be a multiple of track size (63 sectors) because cylinder size incorporates tracks.
The nearest fitting solution for the offset of the slice is
In this case, the offset for the proper partition start is
p0..p1=(1024-3*255)*63=259*63=16317 sectors. This "wastes" roughly 8MB near the start of disk... let's consider it a small unexpected reserve of empty pages for SSD's redundancy rotation of chips
The slice size is a complicated beast – it should be a multiple of both the cylinder size and the SSD page size. In this example, the unit of measurement is
(63*255)*512 legacy sectors, or almost 4GB. A few of these are sufficient for a pool, depending on your installation size and expectations to expand (i.e. ability to add software and boot environments for safe updates, and whether you would store any
swap volumes on the
rpool). In this example, I opted for an approximately 33GB
rpool, or 8 such nameless size units.
Now we will wipe and recreate the partition table with proper values. Following is an example screenshot of a system actually configured with the math described above.
I hope the commands and outputs above are pretty self-explanatory The gap between
p2 happens to be non-zero, but an acceptable whole two tracks (126 sectors).
format's suggestion of the Solaris SMI labelling, and create the slice for
Now we can proceed to Advanced - Creating an rpool manually.