1. Introduction.
This article describes results of a bit of research I made on efficiency of file system cache replacement algorithms. As everybody knows file system tries to keep some portions of its backing secondary storage (disk) cached in the primary storage (memory). In particular portion of file system data or meta-data that is being accessed right now has to be present in the memory, because kernel cannot directly manipulate data on the disk. Replacement problem occurs when this cache grows too large (which it tends to do sooner or later, as secondary storage is usually much larger than memory). When cache cannot grow anymore, and new data have to be accessed, something has to be thrown out of the cache. The selection of cache item to be removed is made by replacement algorithm. Choice of this algorithm affects file system performance, because if/when item removed from cache is re-accessed, file system has to wait for disk IO to read it again. Replacement algorithms try to minimize this overhead, by using various statistical information and trying to predict future accesses. The very possibility of such predictions is based on assumed regularities in file system accesses, most notable being locality of reference and access patterns, such as sequential file scanning. Research in replacement algorithms is as old as file systems, but few last decades in was mainly done in the context of data base engines (that also cache part of database in the memory). The reason for this is that in most modern operating system kernels file system caching was unified with VM caching. Perceived advantages of this unification are:- simplification,
- support of mmap(2) style accesses to file system, and
- sharing of memory between file system and VM, without imposing artificial limits on size of each cache.
2. Tracing method.
An important point in collecting file system traces is to make them independent from caching and replacement algorithms implemented in the kernel being traced. This, for example, rules out placing tracepoints at the level of struct address_space_operations (or below), because code at that level is invoked by VM, and as a result sequence of event depends on algorithms used by VM. On the other hand, placing tracepoints at the syscall/VFS level is not very convenient either, because, while usable for tracing access to data pages (read/write/truncate), it is not adequate for tracing meta-data. The only remaining possibility is to put tracepoints into file system driver, but that would require changing all file systems. As a compromise, tracepoints were placed in mm/filemap.c, mm/readahead.c and mm/truncate.c "generic" code that is used by many file systems:- do_generic_mapping_read()
- intercepts reads
- filemap_nopage()
- intercepts page faults on file system pages
- __grab_cache_page()
- intercepts writes
- truncate_inode_pages_range()
- intercepts truncates
- __do_page_cache_readahead()
- intercepts readahead
tracepoint
fslog(mapping, index, page, opcode);
enum fslog_rec_type
enum fslog_rec_type { FSLOG_READ, /* read */ FSLOG_RA, /* read-ahead */ FSLOG_WRITE, /* write */ FSLOG_PFAULT, /* page-fault */ FSLOG_PUNCH /* page truncate */ };
struct fslog_record
struct fslog_record { __u32 fr_no; /* monotonically increasing event counter. Used to detect lost events. */ __u32 fr_time; /* event time in microseconds. */ __u32 fr_dev; /* major:minor of device, where file is located. */ __u32 fr_ino; /* file inode number */ __u32 fr_gen; /* file inode generation. */ __u32 fr_index; /* logical index of accessed page. */ __u16 fr_pid; /* pid of process. */ __u8 fr_type; /* access type, taken from enum fslog_rec_type. */ __u8 fr_bits; /* miscellaneous page bits. */ __u32 fr_pad; /* alignment padding. */ char fr_comm[16]; /* "process name". */ char fr_name[16]; /* "file name". */ };
No comments:
Post a Comment