a very occasional diary.




File system replacement algorithms.

1. Introduction.

This article describes results of a bit of research I made on efficiency of file system cache replacement algorithms. As everybody knows file system tries to keep some portions of its backing secondary storage (disk) cached in the primary storage (memory). In particular portion of file system data or meta-data that is being accessed right now has to be present in the memory, because kernel cannot directly manipulate data on the disk. Replacement problem occurs when this cache grows too large (which it tends to do sooner or later, as secondary storage is usually much larger than memory). When cache cannot grow anymore, and new data have to be accessed, something has to be thrown out of the cache. The selection of cache item to be removed is made by replacement algorithm. Choice of this algorithm affects file system performance, because if/when item removed from cache is re-accessed, file system has to wait for disk IO to read it again. Replacement algorithms try to minimize this overhead, by using various statistical information and trying to predict future accesses. The very possibility of such predictions is based on assumed regularities in file system accesses, most notable being locality of reference and access patterns, such as sequential file scanning. Research in replacement algorithms is as old as file systems, but few last decades in was mainly done in the context of data base engines (that also cache part of database in the memory). The reason for this is that in most modern operating system kernels file system caching was unified with VM caching. Perceived advantages of this unification are:
  • simplification,
  • support of mmap(2) style accesses to file system, and
  • sharing of memory between file system and VM, without imposing artificial limits on size of each cache.
The latter goal was never fully achieved: all existing implementations treat file system and VM pages differently to some extent, because system performance is miserable otherwise. This hints that access patterns are so different for file system and VM pages, that unified caching is not necessary the best solution. Another problem with unified caching is that the same replacement algorithm has to be used, and this basically means that replacement algorithm may rely only on information available for both file system and VM pages to make its decisions. As a set of attributes that file system can maintain for its pages is a strict superset of attributes maintained for VM pages, unified replacement algorithm degenerates into a VM replacement algorithm and all additional hints that file system can provide are thrown away. Experiments described below are an attempt to determine whether separate replacement algorithm can improve file system performance. To measure efficiency of replacement algorithms file system traces were collected, and then processed by a user-level simulator. Trace is a sequence of records each describing a particular access to some page of a file (see details below). User level simulator processes trace record by record, emulating both memory and secondary storage, keeping track of currently cached pages, and performing replacement decisions when necessary (also, see below for details).

2. Tracing method.

An important point in collecting file system traces is to make them independent from caching and replacement algorithms implemented in the kernel being traced. This, for example, rules out placing tracepoints at the level of struct address_space_operations (or below), because code at that level is invoked by VM, and as a result sequence of event depends on algorithms used by VM. On the other hand, placing tracepoints at the syscall/VFS level is not very convenient either, because, while usable for tracing access to data pages (read/write/truncate), it is not adequate for tracing meta-data. The only remaining possibility is to put tracepoints into file system driver, but that would require changing all file systems. As a compromise, tracepoints were placed in mm/filemap.c, mm/readahead.c and mm/truncate.c "generic" code that is used by many file systems:
intercepts reads
intercepts page faults on file system pages
intercepts writes
intercepts truncates
intercepts readahead
All tracepoints look exactly the same:
 fslog(mapping, index, page, opcode);
Where mapping is an object (inode) on which operation is preformed, index is a logical offset of accessed page within object, page is a page itself, if present, NULL otherwise, and opcode is operation code, one of
enum fslog_rec_type
enum fslog_rec_type {
        FSLOG_READ,    /* read */
        FSLOG_RA,      /* read-ahead */
        FSLOG_WRITE,   /* write */
        FSLOG_PFAULT,  /* page-fault */
        FSLOG_PUNCH    /* page truncate */
fslog() packs arguments into standard event record:
struct fslog_record
struct fslog_record {
        __u32 fr_no;       /* monotonically increasing event counter.
                              Used to detect lost events. */
        __u32 fr_time;     /* event time in microseconds. */

        __u32 fr_dev;      /* major:minor of device, where file is located. */
        __u32 fr_ino;      /* file inode number */

        __u32 fr_gen;      /* file inode generation. */
        __u32 fr_index;    /* logical index of accessed page. */

        __u16 fr_pid;      /* pid of process. */
        __u8  fr_type;     /* access type, taken from enum fslog_rec_type. */
        __u8  fr_bits;     /* miscellaneous page bits. */
        __u32 fr_pad;      /* alignment padding. */

        char  fr_comm[16]; /* "process name". */
        char  fr_name[16]; /* "file name". */
That might looks like awfully excessive amount of information, but it's necessary to reliably identify unique pages and files, as will be shown below. Once record is composed, it is exported to the user space through relayfs. Tracing patch for 2.6.18-rc2 kernel is available here. Continued.

Follow by Email