a very occasional diary @ Nikita Danilov: File system replacement algorithms (cont.)

3. User level tools.

Replacement algorithms simulator is in replacement.c. It consists of conventional main loop that reads preprocessed trace records and feeds them to selected replacement algorithm. Replacement algorithms (struct repalg) are executed in the following standard environment:

Primary storage (memory) which is an array (mm.m_frames[]) of a given (through command line option) number of equal sized frames (struct frame). Some frames are occupied by pages, others are free. Replacement algorithm may maintain some private information for each frame.
Secondary storage which is an array mm.m_vpages[] of a given number of pages (struct vpage). This can be thought as a set of all distinct pages ever accessed by a trace being processed. Each page, in addition to its index in this array, is uniquely identified by a file and logical offset within this file. At any given moment in time page is either resident (i.e., loaded into some frame in memory) or non-resident. Replacement algorithm may maintain some private information for each page.
Array (mm.m_objects[]) of files (struct object). Each file maintains a list of all its pages (not frames!). This is needed for efficient emulation of truncate, see below.

These data structures simplify implementation of replacement algorithms, but do not directly match information available in the trace. This gap is filled by preprocessing script fslog.awk that reads in whole trace and generates unique numeric identifiers for pages and files (these identifiers are used as array indices by replacement emulator). Script also does some additional book-keeping, collects few statistical counters, and does rudimentary sanity checking of the trace. Key point is to reliably tell when pages belong to the same or to the different files. Obvious solution of using (dev, ino) pair as a unique file identifier doesn't work, because certain popular file systems tend to reuse inode numbers immediately after file deletion, leading to false positives. This is why trace record contains additional ->fr_gen and ->fr_name fields: tuple (dev, ino, generation, name) is reliable file identity.

Another complication arises with truncate handling. In all other cases trace records produced by kernel map one to one into page accesses, handled by replacement algorithm. But in the case of truncate, kernel walks a list of resident pages only. Producing trace records for each walked page isn't what we want, because this list depends on a replacement policy implemented by kernel. Instead, tracing patch generates only one record for whole truncate operation, with ->fr_index field indicating the first page to be truncated. For such record replacement.c walks a list of all pages, associated with corresponding file and calls page based truncate entry point (repalg.m_punch()) of replacement algorithm. It is to implement this walking efficiently that files exist as a first class entities in replacement.c.

The only remaining bit is converting binary trace records into form suitable for fslog.awk. This is done by fslog.c based on exported-global.c sample program from relayfs package.

4. Results.

4.0. Experiment description.

Analyzed trace was obtained by running

$ make defconfig && make -j4 bzImage modules

on a single processor machine (other machine characteristics such as memory size and available secondary storage do not matter, as was explained above) in 2.6.18-rc2 kernel source directory. In addition trace contains records generated during system boot-up. To avoid cluttering the trace with records generated by fslog.c writes, its output was transferred to other machine by netcat(1). Total amount of output was 2.7GB (not provided due to size).

fslog.awk processed 25927993 records, accessing total of 2356222 pages in 86839 distinct files. Resulting trace, suitable as input to replacement.c is here (79MB), file identities are here.

Following replacement algorithms were tested:

RANDOM
LRU (least recently used)
FIFO (first in, first out)
FIFO2 (fifo, second chance)
SFIFO (segmented fifo, by Rollins Turner and Henry Levy)
2Q (by Theodore Johnson and Dennis Shasha)
CAR (by Sorav Bansal and Dharmendra S. Modha)
ARC (by Nimrod Megiddo, Dharmendra Modha)
LINUX (emulation of mm/vmscan.c)
WORST (theoretically worst possible replacement)
OPT (theoretically optimal clairvoyant algorithm by Belady)

RANDOM, FIFO, FIFO2, LRU and Segmented FIFO are „classical“ replacement algorithms invented and extensively studied in 60s. 2Q, CAR and ARC are „new“ algorithms developed recently as interest to replacement policies reappeared. OPT and WORST algorithms cannot be implemented in practice, because they are both „clairvoyant“ — they look ahead through the trace. LINUX algorithm tries to emulate mm/vmscan.c with the following limitations and simplifications:

no mapped or anonymous pages;
no kswapd, only direct reclaim;
all allocations are with GFP_FS;
no non-fs caches;
no low_page reserves;
single zone.

which is reasonable for comparison of file system dominated non-degenerate work-loads.

2Q and SFIFO have tunable parameters (see their papers referenced above for description of parameters). 2Q was tested with kin of 25, and kout ranging from 10 to 50 with step 10. kout of 10 proved to produce best results, and this value was used in comparison with other algorithms. SFIFO was tested „tail lru percentage“ from 0 (pure FIFO) to 20, with very small observed effect on results. All other algorithms are self-tuning.

All algorithms were ran against the same trace, and tested with the same set of memory sizes (varying from 5 to 4M frames, which corresponds to 20KB—16GB range of memory sizes with 4K frame).

Results, together with shell commands used to produce them are available here. A couple of notes:

OPT algorithm is extremely slow (as expected);
LINUX algorithm fails for very small memories (less than 64 frames).

4.1. Graphs: overall.

Below are few graphs, created by gnuplot script (input files for gnuplot are generated by process.hits, and PNG images are generated from postscript by generate-png). Everywhere X-axis (horizontal) is memory size (logarithmical), and Y-axis (vertical) is hit ratio in percents, unless specified otherwise.

First, all algorithms on the single graph:

One immediate observation is that for memories larger than 28MB difference between algorithms is very small. Another notable thing is that in 56KB—6MB range LINUX algorithms is worse than everything except WORST. Remarkably it's even worse than RANDOM, which surely is quite a feat of engineering.

Next graph shows how algorithms improve upon WORST. Here Y-axis is difference between hit ratio of algorithm and WORST for corresponding memory size.

As it is by definition impossible to not beat WORST, improvement (or lack thereof) against RANDOM is also useful measure:

Next graph shows how far algorithms fall behind OPT. In this graph Y-axis is given by 100.0 - hit_ratio(OPT) + hit_ratio(ALG), that is, OPT corresponds to 100.0.

Revealed behavior is quite interesting. For very small memory sizes, all algorithms perform more or less equally bad, but as memory size increases, OPT quickly outperforms everything else. This corresponds to the initial dip in the 32KB—512KB range. After that point, non-OPT algorithms really start „working“ and gradually approach OPT, reaching (in concert!) local optimum somewhere in 32MB—64MB range. But as memory size increases even more, all of them start lagging behind OPT more and more again. In 512MB—16GB range all algorithms converge, because in this here where total working set of the work load in question lies, and once work set fits into memory it is hard to behave too bad.

Second local minimum is rather strange. As it happens for all non-OPT algorithms in a regular manner, it's logical to assume that it is result of some property of the work-load in question. One might hypothesize, that in this range of memory sizes, OPT algorithm is able to exploit some long-term patterns in file system accesses during kernel build.

4.2. Graphs: tunable parameters.

Next, let's look how tunable parameters affect 2Q and SFIFO performance:

Labels are in 2Q:kin:kout format. Obviously, the smaller kout the better, and 2Q indeed beats LRU most of the time, in accordance with claims of its authors.

The same for SFIFO.

Segmented FIFO keeps frames in two lists: head and tail. Head list (containing „hot“ pages) is maintained in FIFO discipline, and tail list — in LRU discipline. Size of the tail list is set to fixed percentage of full memory (when this percentage is 0, SFIFO degenerates into FIFO. When it is 100.0, SFIFO becomes LRU.). The logic behind this is that by making tail LRU list sufficiently small it becomes practically feasible to implement LRU in it by un-mapping pages on entry to tail list. Our results show that for all reasonable sizes of tail list, difference in hit ratio is negligible.

4.2. Graphs: classical algorithms.

Let's look at some data of historical interest: how „classical“ replacement algorithms (LRU, FIFO, and FIFO2) behave relative to OPT, WORST, and RANDOM (LINUX is shown here too, because it basically falls into the same class of algorithms).

Classical algorithms versus WORST:

And versus OPT:

It's interesting to look how these algorithms behave in the memory ranges typical for the time when they were actively researched:

2MB—32MB range:

At some points difference in hit ratios can reach 5%, but as we move to larger memories (16MB—4GB)...

difference shrinks to ~1% range and remains there (256MB—4GB):

4.3. Graphs: LINUX.

At last, compare how LINUX fares relative to other algorithms. We shall concentrate on data for large memories (64MB—4GB range), because behavior of LINUX algorithm in low memory setups is clearly visible at the overall graph: it loses to everyone.

The natural question arises: why LINUX algorithm is so... not brilliant (albeit, keep in mind that all this happens only few percent within OPT, so not too much is at the stake)? As was noted at the beginning, unified FS/VM caching forces replacement algorithm to use only bare minimum of available per-page information. In particular, it so happens that for clean file system pages, LINUX is equivalent to FIFO (known bad algorithm), and for dirty pages — to FIFO2.

5. Conclusion.

Assorted miscellaneous notes, actually.

One thing obvious from traces, but not mentioned above is that larger portion of all trace records is for page faults. Absolute majority of them are compulsory misses that happen because every new process maps all standard libraries into its address space. By looking at the traces, it seems that optimizations in this area (e.g., pre-mapping all already resident pages of the given library) would be clearly advantageous.

Handling of file system pages by LINUX algorithms can definitely be improved.

6. Future directions.

More algorithms are to be implemented in replacement.c: LFU, LRU/k, LIRS, etc. Also, more statistics (working set size, page fault rate, etc.) can be collected.

2 comments:

Anonymous01 June, 2007 17:49
I wonder how well a Most Recently Used (MRU) algorithm would work? This might seem like a stupid idea until you consider that the most recently used pages are the ones that will still be in the disk cache, and so will be the least expensive to reload. Your simulation wouldn't show any gains because it doesn't take into account the relative costs of loading data which is in the disk's cache vs not in the disks's cache.

Nice work.
Unknown03 July, 2007 13:31
A friend of mine co-owns a heavily loaded shell/web/mail hosting machine. It might be quite interesting to gather some data when it's under heavy load. That happens quite often, and the poor machine is swapping like mad. But tracing a production machine is a difficult bussiness...

I just started to look at ZFS which uses ARC, and I'm surprised by the drop in ARC results at 2GB.

BTW, will there ever be more reiser4 articles?

2006-10-30

File system replacement algorithms (cont.)