2005-12-18

reiser4: 0. introduction

reiser4: 0. introduction

This is an introduction to the series of entries I am going to write about reiser4 file system. The plan, for the time being, goes like this:
  • 0. introduction
    • 0. trivia
    • 1. reiser4 strong points
    • 2. architecture overview
  • 1. internal tree
    • 0. B-trees overview
    • 1. lookup and modification
    • 2. concurrency control: lock ordering, hi-lo ordering
    • 3. eottl
    • 4. node format
  • 2. tree clients
    • 0. regular file
    • 1. tail to extent conversion
  • 3. transaction engine
    • 0. goals: atomicity, no isolation, durability
    • 1. log structure
    • 2. shadow updates
  • 4. directory code
    • 0. overview of hashed directories
    • 1. key structure and its motivation
    • 2. lookup readdir
    • 3. NFS?
  • 5. delayed allocation and the flush algorithm
    • 0. delayed allocation motivation
    • 1. flush algorithm
    • 2. overview of interaction with VM
    • 3. emergency flush
  • 6. round up

0. trivia

Reiser4 is a relatively new file system for Linux developed by namesys. Its development started early in 2001, and currently reiser4 is included into -mm patch series. Reiser4 is a next version of well-known reiserfs v3 file system (also known as simply reiserfs) that was first and for a long time the only journalled file system for Linux. While reiser4 is based on the same basic architectural ideas as reiserfs v3 it was written completely from scratch and its on-disk format is not compatible with reiserfs. Note on naming: namesys insists that file system is referred to as reiser4 and not reiserfs v4, or reiser4fs, or reiserfs4, or r4fs. Development of reiser4 was sponsored by DARPA, take a look at the namesys front page for a legal statement, and a list of other sponsors.

Technically speaking reiser4 can be briefly discussed as a file system using

  • global shared balanced lazily compacted tree to store all data and meta-data
  • bitmaps for block allocation
  • redo-only WAL (write ahead logging) with shadows updates (called "wandered logs" in reiser4)
  • delayed allocation in parent-first tree ordering

I shall (hopefully) describe these topics in more details in the future entries.

One last trivia point: I was amongst reiser4 developers from the very beginning of this project. At the mid-2004 I left namesys, at that point reiser4 was in more or less debugged state, and no significant design changes occurred since.

1. reiser4 strong points

Together with the unique design, reiser4 inherited reiserfs' strengths and weaknesses:

  • efficient support for very small files: multiple small files can be stored in the same disk block. This saves disk space, and much more importantly IO traffic between secondary and primary storage;
  • support for very large directories: due to balanced tree (to be described later), file system can handle directories with hundreds of million of files. The utility of this (and small files support) is often questioned, because "nobody uses file system in that way". Reiser4 design is based on the assumption, that cause-and-effect goes in the opposite direction here: applications do not use large directories and small files precisely because existing file system didn't provide efficient means for this. Instead, application developers had to resort to various hacks from configuration files (that have obvious hierarchical structure asking for being mapped onto file system), to artificially splitting large directory into smaller ones (look at /usr/share/terminfo/ directory as an example);
  • no hard limit on the number of objects on the file system. Traditional UNIX file systems (s5fs,ffs,ufs,ext2,ext3) fix total number of objects that can exist on a file system during file system creation. Correct estimation of this number is a tricky business. Reiser4, on the other hand, creates on-disk inodes (called "stat data" for historical reasons) dynamically;
  • previous item can be seen as a weakness, because it means that reiser4 on-disk format is more complex than that of its brethren. Other advanced reiser4 features also come not without an increase of format complexity. As a result, reiser4 can be corrupted in much more complicated ways than other file systems, requiring quite sophisticated fsck.

2. architecture overview

As was already hinted above reiser4 is quite different from other file systems architecturally. The most important difference, perhaps, is that reiser4 introduces another layer ("storage layer") between file system and block device. This layer provides an abstraction of persistent "container" into which pieces of data (called "units") can be placed and later retrieved. Units are identified by "keys": when unit is placed into container a key is given, and unit can later be retrieved by providing its key. Unit key is just a fixed-length sequence of bits.

In reiser4 this container abstraction is implemented in a form of special flavor of B-tree. There is one instance of such tree (referred to as "internal tree") per file system.

Entities from which file system is composed (regular files, directories, symlinks, on-disk inodes, etc.) are stored in the internal tree as sequences of units.

In addition to presenting container abstraction, storage layer is responsible for block allocation. That is, as units are inserted and deleted, tree grows and shrinks, taking blocks from block allocator and returning them back. Block allocation policy implemented by storage layer is very simple:

  • units with close keys are kept as close on disk as possible, and
  • units should be placed in the disk blocks so that key ordering corresponds to the block ordering on disk.

This allows higher layers of the system to influence location of units on disk by selecting unit keys appropriately.

While this doesn't sound especially exciting, key-based block allocation allows reiser4 to implement one extremely important feature: global block allocation policies. For example, by assigning to units of all objects in a given directory keys with the same prefix, higher layer code can assure that all objects located in this directory will be located close to each other on the disk. Moreover, by selecting next few bits of key based on --say-- file name it is possible to arrange all objects in the directory in the same order as readdir returns their names, so that commands like

$ cp -a * /somewhere

will not cause a single seek during read. More on this in the section on "delayed allocation and the flush algorithm".

So the highest layer in reiser4 implements file system semantics using storage layer as a container. Storage layer uses block allocator to manage disk space and transaction engine to keep file system consistent in case of a crash. Transaction layer talks to block device layer to send data to the storage.

To Be Continued by: internal tree.

6 comments:

  1. What about snapshots like in WAFL or zfs?
    Any chance of them being implemented?

    ReplyDelete
  2. Where are you working these days? and what on?

    I've been following Reiser4 for a while - very interested in the architecture of it.

    ReplyDelete
  3. Stewart,

    I am working for CFS, on lustre file system (of course, file system :-)).

    ReplyDelete
  4. What about snapshots like in WAFL or zfs?
    Any chance of them being implemented?


    As I said, I am no longer reiser4 developer, so I don't know for sure what are namesys internal plans concerning this or that feature. There was some talk about implementing snapshots while I was at namesys, but it didn't go any far. Technically speaking, this shouldn't be hard. I should mention this in the section on wandering logs.

    ReplyDelete
  5. Hi , do you have any idea about how to register new softirq ?
    thanks in advance,
    Ram.

    ReplyDelete
  6. Nikita,
    I need to reference in a citation.
    Can you send me your name and surname if you want me to do that? Otherwise I will just use nikita.

    Thank you
    Rodolpho

    ReplyDelete