The configuration below is relatively fragile in setup, so you should not do it without practice on remotely-accessed computers without some means of access to the console (be it IPMI or a colleague who can help as your hands and eyes over the phone). It is also not a required setup, though may be desired (and beneficial) in a number of cases.
This document may contain typos or factual errors. Test before you try. While great care has been taken to verify the sample commands, I have yet to make a full installation and split-root it by copy-pasting this page's contents to verify it completely. If somebody does this first – please leave a note here
It may be desirable for a number of reasons to install the OI global zone not as a single uncompressed dataset (as was required until recently – before LZ4 compression became supported in GRUB and
rpool with oi_151a8 dev-release), but as an hierarchy of datasets with separate
/opt and maybe other datasets. While some such datasets contain parts of the OS installation, others like
/var/logs contain "usual" data which you may want shared (not cloned) between the different BE's (Boot Environments). This way whenever you reboot into one BE or another, such as during development or tests of new releases (and perhaps switching back to a "stable" BE for some reason), your computer's logged history would be appended to the same file regardless of the BE switcheroo.
Note that while these instructions are tailored for OpenIndiana, the history of this procedure in my practice tracks back to Solaris 10 and OpenSolaris SXCE (with LiveUpgrade instead of
beadm). Much of the solution is also applicable there, though some back-porting of the procedure may be needed.
What benefits can this bring? On one hand, greater compression (such as
gzip-9 for the main payload of the installed system binaries in
/usr). The default installation of OI (fresh from GUI LiveCD) is over 3Gb, which can compress over 2.5x to about 1Gb with
gzip-9 applied to
/var. Actually, given the possible SNAFUs with this setup and ability to compress just over 2x (down to 1.2Gb for a "monolithic" rootfs dataset on pools with
ashift=9, or about 2Gb on pools with
lz4 which is now supported for the root datasets, this one benefit may be considered moot. Still, either of these brings some benefit to space-constrained systems, such as installations on cheap but small SSDs. Also, less on-disk data means less physical IOs (both operations and transferred bytes) during reads of the OS and its programs, at the (negligible) cost of decompression.
Another benefit, which is more applicable to the shared datasets discussed above, is the ability to assign ZFS
quota to limit some datasets from eating all of your
rpool, as well as
reservations to guarantee some space to a hierarchy based at some dataset, or
refreservations to guarantee space to a certain dataset (excluding its children, such as snapshots, clones and datasets hierarchically "contained" in this one). This helps improve resilience of the system against
In case of SSD-based
rpools, or especially of slow-media root pools such as ones on USB sticks or CF cards, it may also be beneficial to move some actively written datasets to another, HDD-based data pool, in order to avoid excessive wear of flash devices (and/or lags on USB devices). While this article does not go into such depths, I can suggest that and
/var/cores can be relocated to another pool quite easily (maybe also
/var/crash and some others). And in case of slow-media and/or space-constrained
rpools you might want to relocate the
dump volumes as well (for
swap it may suffice to add a volume on another pool and keep a small volume on
rpool if desired – it is not required, though).
Also note that one particular space-hog over time can be
/var/pkg with the cache of package component files, and its snapshots as parts of older BEs can be unavoidable useless luggage. Possibility of separating this into a dedicated dataset, within each BE's individual rootfs hierarchy or shared between BEs, is a good subject for some separate research.
What to avoid and expect?
Good question. Many answers :
One problem for the split-root setup (if you want to separate out the
/usr filesystem) is that OpenIndiana brings
/sbin/sh as a symlink to
../usr/bin/i86/ksh93. Absence of the system shell (due to not-yet-mounted
init to loop and fail early in OS boot.
When doing the split you must copy the
ksh93 binary and some libraries that it depends on from
/usr namespace into the root dataset (
/lib accordingly), and fix the
/sbin/sh symlink. The specific steps will be detailed below, and may have to be repeated after system updates (in case the shell or libraries are updated in some incompatible fashion).
My earlier research-posts suggested replacement of
bash; however, this has the drawback that the two shells are slightly different in syntax, and several SMF methods need to be adjusted. We have to live with it now –
ksh93 is the default system shell, it just happens to be inconveniently provided in a non-systematic fashion. Different delivery of
ksh93 and the libraries it needs is worthy of an RFE for packagers (TODO: find and provide the issue number in tracker).
Another (rather cosmetic) issue is that many other programs are absent in the minimized root without
/usr, ranging from
svc* SMF-management commands,
vi and so on. I find it convenient to also copy
bash and some of the above commands from
/sbin, though this is not strictly required for system operation – it just makes repairs easier
/var/tmp into a shared dataset did not work for me, at least some time in the past – some services start before
filesystem/minimal completes (which mounts such datasets) and either the
/var/tmp dataset can not mount into a non-empty mountpoint, or (if
-O is used for overlay mount) some programs can't find the temporary files which they expect.
Likewise, separation of
/root home directory did not work well: in case of system repairs it might be not mounted at all and things get interesting
It may suffice to mount a sub-directory under
/root from a dataset in the shared hierarchy, and store larger files there, or just make an
rpool/export/home/root and symlink to it from under
Cloning BE's with
beadm currently does not replicate the original datasets' "local" ZFS attributes, such as
(ref)reservation. If you use
pkg image-update to create a new BE and update the OS image inside it, you're in for surprise: newly written data won't be compressed as you expected it to be – it will inherit compression settings from
rpool/ROOT (uncompressed or LZ4 are likely candidates). While fixing
beadm in this behaviour is a worthy RFE as well (TODO: issue number), currently you should work around this by creating the new BE manually, re-applying the (compression) settings to the non-boot datasets (such as
/usr), mounting the new BE, and providing the mountpoint to
pkg commands. An example is detailed below.
Note that the bootable dataset (such as
rpool/ROOT/oi_151a8) must remain with the settings which are compatible with your GRUB's
bootfs support (uncompressed until recently, or with
lz4 since recently).
Finally, proper mounting of hierarchical roots requires modifications to respective system SMF methods. Patches and complete scripts are provided along with this article, though I hope that one day they will be integrated into
illumos-gate or OI distribution (TODO: issue number), and manual tweaks on individual systems will no longer be required.
How does it work (and what was fixed by patches)?
As far I found out, the bootloader (GRUB) finds or receives with a keyword the bootable dataset of a particular boot environment. GRUB itself mounts it with limited read-only support to read the illumos kernel and mini-root image into memory and passes control to the kernel and some parameters, including the information about the desired boot device (device-path taken from the ZPOOL labels on the disk which GRUB inspected as the
rpool component, and which should be used to start reconstruction of the pool – all cool unless it was renamed, such as from LegacyIDE to SATA... but that's a separate bug) and the rootfs dataset number. The kernel imports the specified pool from the specified device (and attaches mirrored parts, if any), and mounts the dataset as the root filesystem (probably
chroots somewhere in the process to switch from the miniroot image into the
rpool), but does not mount any other filesystem datasets
Then SMF kicks in and starts system startup, passes networking and
metainit (for legacy SVM metadevice support, in case you have any filesystems located on those) and gets to
svc:/system/filesystem/root:default (implemented in
/lib/svc/method/fs-root shell script) which ensures availability of
/usr, and later gets to
svc:/system/filesystem/usr:default (/lib/svc/method/fs-usr) and
svc:/system/filesystem/minimal:default (/lib/svc/method/fs-minimal) and
svc:/system/filesystem/local:default (/lib/svc/method/fs-local) which mount other parts of the filesystems and do related initialization. Yes, the names are sometimes counter-intuitive.
In case of ZFS-based systems,
fs-root does not actually mount the root filesystem (it is already present), but rather ensures that
/usr is available, as it holds the bulk of programs used later on (even something as simple and frequent in shell scripting as
awk). The default script expects
/usr to be either a legacy mount specified explicitly in
/etc/vfstab (make sure to provide
-O mount option in this case), or a sub-dataset named
usr of the currently mounted root dataset. Finally, the script mounts
/boot (if specified in
/etc/vfstab) and the
libc.so hardware-specific shim, and reruns
devfsadm to detect hardware for drivers newly available from
fs-root script adds optional console logging (enable by touching
/.debug_mnt), and enhances the case for ZFS-mounted root and
usr filesystems by making sure that the mountpoints of sub-datasets of the root filesystem are root-based and not something like
/a/usr (for all child datasets), and mounts
/usr with overlay mode (
zfs mount -O) – too often have mischiefs like these two left an updated system unbootable and remotely inaccessible. It also verifies that the mounted filesystem is "sane" (a
/usr/bin directory exists).
fs-usr script deals with setup of
dump, and the patch is minor (verify that
dumpadm exists, in case sanity of
/usr was previously overestimated). For non-ZFS root filesystems in global zone, the script takes care of re-mounting the
/usr filesystems read-write according to
/etc/vfstab, and does some other tasks.
fs-minimal mounts certain other filesystems from
/etc/vfstab or from the rootfs hierarchy. First it mounts
/tmp from the
/etc/vfstab file (if specified) or from rootfs child datasets (if sub-datasets exist and if
mountpoint matches). The script goes on to ensure
/var/run (as a
tmpfs) and mounts other not-yet-mounted non-
legacy child datasets of the current rootfs in alphabetic order.
fs-minimal script adds optional console logging (enable by touching
/.debug_mnt), and allows mounting of the three mountpoints above from a shared dataset hierarchy. If the default mounting as a properly named and mountpointed child of the rootfs failed due to absense of a candidate dataset, other candidates are picked: the script now looks (system-wide, so other pools may be processed if already imported) for datasets with
canmount=on and appropriate
mountpoint. First, if there is just one match – it is mounted; otherwise, the first match from the current
rpool is used, or in absense of such – the first match from other pools which have the default
altroot (unset or set to
/). Another fix concerns the "other not-yet-mounted non-
legacy child datasets of the rootfs" – these are now mounted also in overlay mode, to avoid surprises due to non-empty mountpoints.
fs-local mounts the other filesystems from
mountall) and generally from ZFS via
zfs mount -a (this also includes the rest of the shared datasets, and note that errors are possible if mountpoints are not empty), and also sets up UFS quotas and
swap if there is more available now. No patches here
While the described patches (see fs-splitroot-fix.patch) are not strictly required (i.e. things can work if you are super-careful about empty mountpoint directories and proper
mountpoint attribute values, and the system does not unexpectedly or by your mistake reboot while you are in mid-procedure, or if you use
legacy mountpoints and fix up
/etc/vfstab in each new BE), they do greatly increase the chances of successful and correct boot-ups in the general case with dynamically-used boot environments, shared datasets and occasional untimely reboots.
The examples below assume that your currently installed and configured OS resides in
rpool/ROOT/openindiana and you want to relocate it into
rpool/ROOT/oi_151a8 with a hierarchy of compressed sub-datasets for system files (examples below use variables to allow easy upgrades of the procedure to different realities), and shared files like
crash dumps will reside in a hierarchy under
This procedure can be done as soon as you have installed a fresh system with the default wizard settings from the LiveCD/LiveUSB – right from the Live environment (if it is networked so that you can get the patched method scripts), or at any time in the future (including a clone of your live system – though note that some changes may be "lost" from the new BE in the timeframe between replicating and actually rebooting; to avoid this you might want to boot into another BE or into the Live media and do the procedure on the "cold" main BE).
This can also be done during a migration of an older system to a new
rpool for example, including a setup based on a clone of the Live media (see Advanced - Manual installation of OpenIndiana from LiveCD media), so that the hierarchical setup is done on your new
rpool right from scratch.
Here is an illustration of what we are trying to achieve:
As you can see in the above example, the installed default OS (
rpool/ROOT/openindiana) and its LZ4-compressed copy (
rpool/ROOT/openindiana-lz4) are much larger than the split-root variant (
rpool/ROOT/oi_151a8). This may be an important difference on some space- or I/O-constrained storage options. The shared filesystems include containers for logs, mailboxes, OS crash and process coredump images, and GZ mailqueues (rebooting into another BE does not mean you don't want those messages delivered, right?) – these can be restricted with quotas or relocated to other pools.
This particular system also spit off
/usr/local in order to allow easy creation of clones delegated into local zones – so as to provide modifiable sets of unpackaged programs with little storage overhead. This is not a generally needed scenario
Now we're down to the dirty business ;)
Like in other low-level manuals, the user is expected to run as
pfexec as desired, if you run as a non-root), and the shell prompt for commands you enter is
:; for ease of copy-pasting.
Let's start by preparing some environment variables:
Note that these settings can be defined differently on the source and target hosts, if you clone the installation onto another machine, i.e. from a production system onto a new one (or a VM) currently booted with LiveCD which has the newly created
rpool alt-mounted under
Optional step – creation of the new
If you are migrating an installation to a new root pool, be it change of devices or cloning of an existing installation to a remote machine, you can take advantage of the new layout right away. If your devices have large native sectors or pages and would benefit from aligned access, then first you should settle on the partitioning and slicing layout which would ensure alignment of the
rpool slice. This is a separate subject, see Advanced - Aligning rpool partitions.
Then go on to Advanced - Creating an rpool manually and return here when done
So, here the fun begins. One way or another, I assume that you have a (target)
rpool created and initialized with some general options and datasets you deemed necessary. This includes the case of splitting the installation within one machine and one
rpool, where you just continue to use the other datasets (such as
dump and the default admin-home tree under
Create the base rootfs, note that its compression should match GRUB's support:
If any unexpected errors were returned or the filesystem was not mounted – deal with it (find the causes, fix, redo the above steps).
Now that you have the new root filesystem, prepare it for children, using your selection of sub-datasets. These will be individual to each OS installation, cloned and updated along with their BE. Generally this includes all locations with files delivered by "system" packages, which are likely to be updated in the future.
To follow the example above:
In the example above, mountpoint directories are protected from being written into by being made immutable. Note that this requires the Solaris (not GNU)
chmod, and that this does not work in Solaris 10 (if you backport the procedure).
Also note that at this point the sub-datasets inherit the
/a prefix in their mountpoints, and will fail to mount "as is" with the currently default scripts (
fs-minimal), unless you later unmount this tree and change the rootfs to use
Next we prepare the shared filesystems. To follow the example above:
This prepares the "container" datasets with predefined compression and mountpoint attributes; you can choose to define other attributes (such as
copies) at any level as well. These particular datasets are completely not mountable so as to not conflict with OS-provided equivalents, they are only used to contain other (mountable) datasets and influence their mountpoints by inheritance, as well as set common quotas and/or reservations. Also note that currently the shared
var components are not mounted into the
rpool altroot, but are offset by
"$BENEW_MPT" prefix. This will be fixed later, after data migration.
Now we can populate this location with applied datasets. Continuing with the above example of shared parts of the namespace under
/var, we can do this:
NOTE: Don't split off
/var/tmp like this, at least not unless you are ready to test this as much as you can. It was earlier known to fail.
The example above creates the immutable mountpoint directories in the rootfs hierarchy's version of
/var, then creates and mounts the datasets into the new hierarchy's tree. Afterwards some typically acceptable quotas (YMMV) are set to protect the root file system from overfilling with garbage. Also,
zfs/auto-snapshot service is forbidden to make autosnaps of the common space-hogs
/var/crash, so that deletion of files from there to free up
rpool can proceed unhindered.
Now that the hierarchies have been created and mounted, we can fill them with the copy of an installation.
This chapter generally assumes that the source and target data may be located on different systems connected by a network, and appropriate clients and servers (SSH or RSH) are set up and working so that you can initiate the connection from one host to another. The case of local-system copying is a degenerate case of multi-system, with
TGT components and the
RSH flag all empty
First of all, you need to provide the original filesystem image to copy. While a mounted alternate BE would suffice, the running filesystem image "as is" usually contains
libc.so and possibly other mounts, which makes it a poor choice for the role of clone's origin. You have a number of options, however, such as diving into snapshots, creating and mounting a full BE clone, or
lofs-mounting the current root to snatch the actual filesystem data (this case being especially useful back in the days of migration of Solaris 10 roots from UFS to ZFS).
The procedure may vary, depending on your original root filesystem layout – whether it is monolithic or contains a separate
/var, for example.
All of the examples use
rsync– it does the job well, except maybe for lack of support for copying ZFS/NFSv4 ACLs until (allegedly) rsync-3.0.10. Flags used include:
-x– single-filesystem traversal (only copy objects from source filesystem, don't dive into sub-mounts – verify and ensure that mountpoints like
/tmpshould ultimately exist on targets)
-avPHK– typical recursive replication with respect for soft- and hard-links and verbose reports
-z– if you copy over a slow network link, this would help by applying compression to the transferred data (not included in examples below)
This is for systems with
beadm applicable to the selected source dataset (i.e. the source BE resides in the currently active origin
Now that you are done replicating the source filesystem image, don't rush to boot it. There are some more customizations to make.
Just in case you mess up in the steps below, have something to roll back to:
First of all, if you have split out the
/usr filesystem, you should make sure that
/sbin/sh is a valid working copy of the
ksh93 shell (or whichever is default for your system, in case of applying these instructions to another distribution). Some other programs, such as
ls, may also be copied from
/sbin (paths relative to your new rootfs hierarchy) at your convenience during repairs without a mounted
/usr, but are not strictly required for OS operation.
Make backups of originals and get the files attached to this article. Examples below use
wget for internet access, but a non-networked system might require other means (like a USB stick transfer from another. networked, computer).
For other releases and distributions it may be worthwhile to get the patches as fs-splitroot-fix.patch and apply them. I hope that ultimately this logic will make it upstream and patching will no longer be necessary
$BENEW_MNT/etc/vfstab does not reference filesystems which you expect to mount automatically – such as the shared filesystems or non-legacy children of the rootfs du-jour.
ssh SMF service normally depends on quite an advanced system startup state – with all user filesystems mounted and
autofs working. For us admins
ssh is a remote management tool which should be available as early as possible, especially for cases when the system refuses to mount some filesystems and so start some required dependency services.
For this administrative access to work in the face of failed
zfs mount -a (frequent troublemaker), we'd replace the dependency from
filesystem/usr which ensures that the SSH software is already accessible at least for admins:
This is not strictly related to split-roots, but since we set up the
/var/cores dataset here – this is still a good place to advise about its nice system-wide setup. The configuration below enables the global zone to capture all process coredumps, including those which happen in the local zones, and place them into the common location. This way admins can quickly review if anything went wrong recently (until this location gets overwhelmed with data). Create the
/etc/coreadm.conf file and it will be sucked in when the
coreadm service next starts up in the new BE:
This is a simple one – just in case, run
bootadm on the new rootfs hierarchy:
This is a pretty important step in making sure that datasets are mountable as expected on a subsequent boot:
We've done this before, can do it again:
This sets up the default root filesystem for booting (if not specified explicitly in GRUB menu file):
As discussed earlier, this hierarchy also requires a bit of special procedure to upgrade the installation. While it is customary to have the
pkg command create all needed BE datasets and proceed with the upgrade in the newly cloned BE, we'd need to reenable compression and maybe some other attributes first.
Environment variables are similar to ones used in the manual above, but there are less since we are playing within one
So, we clone the current BE (from which we want to upgrade):
This should create snapshots and clones of the rootfs dataset and its children – but alas, the process (currently) loses most of the ZFS locally defined attributes, such as
This took care of proper compressions, and maybe other customizations,
Now you can update the new BE and retain the savings thanks to your chosen compression rate, and it should go along these lines:
Don't forget to verify (or just redo) the copying of
/sbin/sh and its related libraries, especially if they have changed, revise the patched filesystem method scripts and other customizations discussed above (as well as others you do on your systems).