Child pages
  • Advanced - Split-root installation

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Explanation of dataset "mountpoint" and "canmount" requirements for the split-root solution

...

  • One problem for the split-root setup (if you want to separate out the /usr filesystem) is that OpenIndiana brings /sbin/sh as a symlink to ../usr/bin/i86/ksh93. Absence of the system shell (due to not-yet-mounted /usr) causes init to loop and fail early in OS boot.
    When doing the split you must copy the ksh93 binary and some libraries that it depends on from /usr namespace into the root dataset (/sbin and /lib accordingly), and fix the /sbin/sh symlink. The specific steps are detailed below, and may have to be repeated after system updates (in case the shell or libraries are updated in some incompatible fashion).

    Note

    My earlier research-posts suggested replacement of /sbin/sh with bash; however, this has the drawback that the two shells are slightly different in syntax, and several SMF methods need to be adjusted. We have to live with it now – ksh93 is the default system shell, it just happens to be inconveniently provided in a non-systematic fashion. Different delivery of ksh93 and the libraries it needs is worthy of an RFE for packagers (tracked as issue #4351).

  • Another (rather cosmetic) issue is that many other programs are absent in the minimized root without /usr, ranging from df, ls, less and cat to svc* SMF-management commands, vi and so on. I find it convenient to also copy bash and some of the above commands from /usr/bin into /sbin, though this is not strictly required for system operation – it just makes repairs easier (wink) 

  • A much more serious consequence of the absence of programs from /usr is that some SMF method scripts which initialize the system up to the "single-user milestone", including implementations both default and nwam implementations of svc:/network/physical, rely on some programs from /usr. The rationale is that network-booted miniroot images carry the needed files, and disk-based roots are expected to be "monolithic". It is possible to fix some of those methods (except NWAM in the default setup, at least), but a more reliable and less invasive solution is to mount the local ZFS components of the root filesystem hierarchy (and thus guarantee availability of proper usr) before other methods are executed. This is detailed below as the svc:/system/filesystem/root-zfs:default service with fs-root-zfs script as its method.
    NOTE for readers of earlier versions of the document: this script builds on my earlier customizations of the previously existing filesystem methods; now the these legacy scripts don't need many modifications (I did add just the needed checks whether a filesystem has already been mounted).

  • Separation of /var/tmp into a shared dataset did not work for me, at least some time in the past – some past (before the new fs-root-zfs service) – some existing services start before filesystem/minimal completes (which mounts such datasets) and either the /var/tmp dataset can not mount into a non-empty mountpoint, or (if -O is used for overlay mount) some programs can't find the temporary files which they expect.
    It is possible that with the introduction of fs-root-zfs this would work correctly, but this is not thoroughly tested yet.

  • Likewise, separation of /root home directory did not work well: in case of system repairs it might be not mounted at all and things get interesting (wink)
    It may suffice to mount a sub-directory under /root from a dataset in the shared hierarchy, and store larger files there, or just make an rpool/export/home/root and symlink to it from under /root (with the latter being individual to each BE).

  • Cloning BE's with beadm currently does not replicate the original datasets' "local" ZFS attributes, such as compression or quota or (ref)reservation. If you use pkg image-update to create a new BE and update the OS image inside it, you're in for surprise: newly written data won't be compressed as you expected it to be – it will inherit compression settings from rpool/ROOT (uncompressed or LZ4 are likely candidates). While fixing beadm in this behaviour is a worthy RFE as well (issue numbers #4355 for pkg and #3569 for beadm and zfs), currently you should work around this by creating the new BE manually, re-applying the (compression) settings to the non-boot datasets (such as /usr), mounting the new BE, and providing the mountpoint to pkg commands. An example is detailed below.
    Note that the bootable dataset (such as rpool/ROOT/oi_151a8) must remain with the settings which are compatible with your GRUB's bootfs support (uncompressed until recently, or with lz4 since recently).  

  • Finally, proper mounting of hierarchical roots requires modifications to some system SMF methods. Patches and complete scripts are provided along with this article, though I hope that one day they will be integrated into illumos-gate or OI distribution (issue number #4352), and manual tweaks on individual systems will no longer be required.

...

The patched fs-root script (earlier) or the replacement fs-root-zfs script (later) introduces optional console logging (enable by touching /.debug_mnt in the root of a BE), and enhances the case for ZFS-mounted root and usr filesystems by making sure that the mountpoints of sub-datasets of the root filesystem are root-based and not something like /a/usr (for all child datasets), and mounts /usr with overlay mode (zfs mount -O – this – takes care of the issue number #997 at least for the rootfs components) – too often have mischiefs like these two left an updated system unbootable and remotely inaccessible. It also verifies that the mounted filesystem is "sane" (a /usr/bin directory exists), and with that in place – restarts (if online) or clears (if in maintenance state) the networking SMF services svc:/network/physical:default or svc:/network/physical:nwam, and svc:/network/iptun:default. The SMF method scripts for the latter rely on /usr and these services are dependencies for the filesystem/root (see issue number #4361). Doing the service restart after making sure /usr is available seems like the "cleanest" and most effective solution.

The fs-usr script deals with setup of swap and dump, and the patch is minor (verify that dumpadm exists, in case sanity of /usr was previously overestimated). For non-ZFS root filesystems in global zone, the script takes care of re-mounting the / and /usr filesystems read-write according to /etc/vfstab, and does some other tasks.

...

While the described patches (see fs-root-zfs.patch for the new solution, or reference fs-splitroot-fix.patch for the earlier solution) are not strictly required (i.e. things can work if you are super-careful about empty mountpoint directories and proper mountpoint attribute values, and the system does not unexpectedly or by your mistake reboot while you are in mid-procedure, or if you use legacy mountpoints and fix up /etc/vfstab in each new BE), they do greatly increase the chances of successful and correct boot-ups in the general case with dynamically-used boot environments, shared datasets and occasional untimely reboots. Also, some networking initialization scripts (notably NWAM) do expect /usr and maybe even /var to be mounted before they run, and the existing filesystem methods (which would mount /usr) happen to depend on them, However, physical:default does run successfully (most of the time, missing just the cut command which can be replaced by a ksh93 builtin implementation).

Specifying which bootfs children or shared datasets to mount

There are several ways to specify which datasets should be mounted as part of the dedicated or shared split-root hierarchy. In the context of descriptions below, the "bootfs children" are filesystem datasets contained within the root filesystem instance requested for current boot via GRUB (explicitly, or defaulting to the value of the ZFS pool's bootfs attribute)

  • "Legacy" filesystem datasets with mountpoint=legacy which are explicitly specified in the /etc/vfstab file inside this bootfs. This allows to pass mount-time options (such as the overlay mount, before it was enforced by the fixed fs-* scripts):

    Code Block
    rpool/ROOT/oi_151a8/usr      -       /usr            zfs     -       no      - 
    rpool/SHARED/var/adm         -       /var/adm        zfs     -       yes     - 

    A drawback of this method for bootfs children is that the file must be updated after each cloning or renaming of the boot environment to match the actual ZFS dataset full name.

  • For bootfs children with specified mountpoint paths (and, for the new fs-root-zfs method, a canmount value other than "off"), mounting happens automatically: for /usr as a step in filesystem/root service, for others as a step in filesystem/minimal service.
    Typically the bootfs children specify canmount=noauto, because after BE cloning the rpool would provide multiple datasets with the same mountpoints, causing errors (conflicts) of automatic mounts during pool imports.
    NOTE: Specifying canmount=off for such datasets with un-fixed old service method implementations in place would log errors due to inability to zfs mount such datasets; however, for datasets other than /usr, the return codes are not checked, so this should not cause boot failures.
    • The filesystem methods can use /etc/vfstab to locate over a dozen paths for mounting (backed by any of the supported filesystem types), many of which are not used in the default installations. Those which might be used in practice with ZFS include /usr, /var, /var/adm and /tmp; these blocks in the method scipts also include logic to mount such child datasets of the current bootfs if they exist and a corresponding path was not explicitly specified in /etc/vfstab.
      Extensions added by me into the fixed scripts (earlier solution) or provided as the new fs-root-zfs method, allow to mount such paths (except /usr and /var) also from a number of other locations as "shared" datasets – if they were not found as children of the current bootfs.

  • For possibly "shared" datasets, other than the explicitly specified short list (above), the legacy filesystem methods only offer the call to "zfs mount -a" from filesystem/local (way after the "single-user" milestone). This implies specified (non-"legacy") mountpoint paths and canmount=on; other datasets are not mounted automatically.
    Extensions provided as the new fs-root-zfs method allow to mount datasets with such attribute values from $rpool/SHARED (where the $rpool name is determined from the currently mounted root filesystem dataset). This ensures availability of active shared datasets as part of the split-root filesystem hierarchy early in boot. In particular, following the "auto-mounting" requirements allows to use datasets with a specified mountpoint path and canmount=off as "containers" for the shared datasets to inherit the parent container's path automatically (i.e. a non-mounting /var node).

Below you can find a screenshot with examples of the non-legacy datasets, both children of the root and shared ones. There is no example of a "legacy" dataset passed through /etc/vfstab because I can't contrive a rational case where that would be useful today (smile) 

Examples?

The examples below assume that your currently installed and configured OS resides in rpool/ROOT/openindiana and you want to relocate it into rpool/ROOT/oi_151a8 with a hierarchy of compressed sub-datasets for system files (examples below use variables to allow easy upgrades of the procedure to different realities), and shared files like logs and crash dumps will reside in a hierarchy under rpool/SHARED.

...

For the oi_151a8 release and several releases before it, the system-provided scripts did not change, so the full scripts can be the easier choice to download: fs-root-zfsfs-root and fs-minimal. As described above, the fs-root-zfs script includes all the logic needed to detect and mount the local ZFS-based root filesystem hierarchy (and skips any non-ZFS filesystems and mountpoints under them), and the existing method scripts are just slightly fixed to expect that the paths they try to manage may have already been mounted. Also, unlike the earlier existing scripts, the fs-root-zfs script explicitly mounts the shared datasets ($rpool/SHARED) early in the system initialization to ensure the complete root filesystem hierarchy to other methods, such as network initialization scripts.

For other releases and distributions it may be worthwhile to get the patches as fs-root-zfs.patch and apply them.

...

Note

It was recently discovered that NWAM network auto-configuration does not work with split-root config based on earlier modifications of fs-root, fs-usr and fs-minimal scripts (hopefully fixed with the recent rehaul to fs-root-zfs as the single solution for this use-case).

Tracing the system scripts has shown that a substantial part of them depends on availability of /usr or even more (in case of NWAM – rather on filesystem/minimal with a proper /var tree), yet services like network/physical are dependencies needed for startup of filesystem/root (which mounts and guarantees to provide the /usr). Most of the methods "broken" in this manner can be amended to use ksh93 builtins and shell constructs instead of external programs and rely only on /sbin (after relocation of ksh93 as /sbin/sh); other solutions are also possible and are now being discussed in the mailing list and the issue tracker. The legacy network method "for servers" (svc:/network/physical:default) happens to work successfully with both static configurations and DHCP, that's why the error was not found for years (wink)