ZFS is a very innovative file system first implemented in Sun Solaris and later ported to many other operating systems, FreeBSD, NetBSD, Linux, Mac OSX, to list a few. Open-ZFS is main ZFS development project site.
The innovation in ZFS comes from combining together a file system and a volume manager and making it full 128bit filesystem, which makes a theoretical possible capacity of 256 quadrillion zettabytes of storage.
ZFS brings a new concept of thinking about filesystems and storage. Physical hard drives are grouped into a ZFS Pool (called zpool). Think of it as a bunch of disks grouped together to offer their full capacity for use. To the system administrator and users a pool seems like a contiguous disk space, hiding the real physical layout. It can consist of a single disk or of many, many disks. The pool can be grown in size by adding new disks. New capacity will be available for use immediately for both pool and filesystems it contains. Storage administrator no longer needs to plan filesystem capacity. It's allocated dynamically from combined capacity of all disks in the pool, following the growth of data being stored in the filesystem.
Below is a simple graphical representation of a ZFS pool.
In this image ZFS pool consists of two virtual devices, which consist of two physical disks each. Terminology will be explained in next chapters.
Before you start building a new pool, it is recommended to look at best practices (to be written, but it is always a good thing to consult ZFS Evil Tuning Guide and don't forget to visit readme1st). ZFS file systems can be created on the fly within a pool and on the first sight they seem like a directory. A filesystem can contain another filesystems, their size is not predetermined during creation and are cheap to create. Actually, a hierarchy of pool and filesystems it contains is described within a system just like typical directory hierarchy. For example filesystems for users home directories might look like following:
In the above each directory is a filesystem on its own rights.
It will be described more thoroughly in next chapters.
ZFS was created with data security and integrity as main concern. It uses transactional semantics. Each change to data on the filesystem (including system metadata) is grouped in what is called a transaction group and group is committed to the pool in atomic operation. It means change either happens or it doesn't, leaving filesystem in a consistent state even in the event of sudden power loss. It also uses Copy On Write (COW) semantics, which means that whenever block of data is modified, the new version is written in a free space, leaving old block unaffected. Only when data has been written to disk, pointers are updated to reflect the change, reducing the risk of partially overwritten data block. When free space on disk ends, old, unused blocks are reclaimed for use.
The mechanism above allows for cheap and fast snapshot and clone mechanism. Since old blocks are left unaffected, creating a snapshot of a filesystem means keeping old list of referenced data blocks, thus allowing for instant snapshot creation. Snapshots are read-only and can be mounted as any other ZFS filesystem. They take only as much real hard disk space, as a data blocks delta between live filesystem and snapshot itself. Example: say you have filesystem with 50 GB of data. After creating snapshot 1 GB is modified. Snapshot will then take 1 GB on disk space. At least one "go back in time" solution was based on this, called Time Slider, in OpenSolaris and later OpenIndiana distributions. Clone is a snapshot promoted to be a full filesystem on its own, with read-write capability, but still take as much space as a normal snapshot. OpenIndiana Boot Environments are based on this capability and beadm tool was ported to use with FreeBSD (read this post on setting up FreeBSD to use ZFS as a root filesystem and use boot environments).
To better ensure the data health, ZFS uses data and metadata checksumming offering few algorithms that can be administratively set. Differently from hardware RAID solutions, checksumming is done on a filesystem level, which helps to reduce bit rot risk. Few levels of redundancy are offered: RAIDZ (rough equivalent of RAID5), RAIDZ2 (equivalent of RAID6) mirrors and triple mirrors. NOTE: Desired level of redundancy needs to be specified during pool creation, this goes specifically to RAIDZ and RAIDZ2, mirrors can be created later by attaching new disks to the pool, creating mirror virtual devices (vdevs). By default creating many disk vdev or pool will create a stripe (equvalent of RAID0), which offers no redundancy at all.
There are numerous other features that you can find useful and they will be outlined in later chapters.