a very occasional diary @ Nikita Danilov: opahead patch

Here is a 2.6.15-rc1 version of opahead patch that was sleeping in my patch-scripts series file for some time.

opahead stands for operation-ahead by analogy with read-ahead. An inspiration for this patch was a USENIX paper that discusses a smarter than usual read-ahead algorithm. Kroeger and Long implemented a sub-system that kept track of read accesses to files and, according to certain stochastic model uses collected information to try to predict future accesses.

opahead uses much simpler model known as Shannon Oracle. (Unfortunately, I failed to find any online reference. I read a description of this algorithm many years ago in a book I no longer have.) It basically maintains enough data to answer following question: for a given sequence of N reads, what reads have been seen to follow this sequence in the past and how many times? Read that followed the sequence most, is assumed to be most likely to happen, and appropriate read-ahead is issued.

If N equals 1 we get familiar Markov Chain model. This is why the patch is sometimes called Markov read-ahead. But Markov chains have one fundamental disadvantage that makes them unsuitable for the read-ahead implementation: to make efficient read-ahead it's necessary to saturate IO layer and for this one has to look more than one event ahead in the future. That is, we have to predict next read, then assume that it did happen, and made new prediction based on that assumption. Obviously, the reliability of consecutive predictions deteriorates quickly, so it's very important to make initial prediction as reliable as possible. To this end, predictions are made on the basis of N previous reads, rather than on the basis of single one as in Markov model.

Another observation is that this algorithm doesn't depend on the nature of the events that are tracked and predicted. It can be generalized to the module that tracks and predicts abstract events that can be compared for equality (and it is that generalized form that is implemented in the patch).

That module can be used to do read-ahead at the different layers: VFS, file system, VM (page cache read-ahead), block layer (physical read-ahead); or it can be used for things completely unrelated to the read-ahead like, god forbid... predicting scheduling activity..

This patch was used to drive read-ahead at the file system layer (by plugging calls to opahead_{happens,predict}() into ext2_file_read()), but even though speed up of kernel compilation was measurable, this turned out to be dead-end, because to read a page from file one has to bring inode it memory first, but currently there is no an API for asynchronous inode loading.

Below is a top-level comment from opahead.h header with the general description of API (at least my effort of typing it in wouldn't be wasted).

opahead API

/*
 * This file is an interface to the op-ahead: a generic prediction engine.
 *
 * op-ahead implementation is in lib/opahead.c
 *
 * General description.
 *
 * op-ahead makes predictions in a certain domain consisting of events. Events
 * happen sequentially. op-head clients notify engine about events that
 * happen, and engine keeps track of event patterns. Engine then can be asked
 * a question: "Given a sequence of N last events, what is the next event to
 * happen most likely?"
 *
 * To answer this question, op-ahead keeps track of all sequences of N events
 * ever observed in the domain. N is called domain depth, and said sequences
 * are called paths. For each path op-ahead maintains a list of events that
 * were observed to follow this path (that is, to happen immediately after
 * events in the path happened in order). These following events (called
 * guesses) are weighted: the weight is a guess is a number of times it was
 * observed to follow this path.
 *
 * op-ahead operation is quite simple: when event E happens, op-ahead finds a
 * path corresponding to last happened events, and increases the weight of
 * guess corresponding to E (creating new guess if necessary).
 *
 * When asked to predict what event will follow given N events, op-ahead finds
 * path corresponding to these events and returns the guess with the largest
 * weight. If there are multiple guesses with the equal weight---random one is
 * selected.
 *
 * When domain depth is 1, paths degenerate into events and op-ahead into
 * modeling Markov chains.
 *
 * op-ahead algorithm is known as "Shannon Oracle". It is reported that in the
 * domain of two events "tail" and "head", and with the depth of 5, Shannon
 * Oracle robustly beats human player in the game of head and tails (that is,
 * after some period of learning it's able to predict next move with the
 * probability higher than 0.5). You are hereby requested to spend at least 10
 * percent of proceeds obtained from playing head and tails with the help of
 * this kernel code to buy hardware and to provide it to Linux kernel
 * developers.
 *
 * Usage.
 *
 * To use op-ahead client creates an instance of struct opa_domain with
 * methods appropriate for its events.
 *
 * Predictions are made in the "context" which is an array of domain->depth
 * pointers to events. This context is created by client (initialized by NULL
 * pointers) and is passed to opa_{predict,happen}(). It is used to record
 * latest events. When context is no longer needed, client calls
 * opa_context_put().
 *
 * When new event happens, client calls opa_happens() to record it. To predict
 * future events opa_predict() is used.
 *
 * Related work.
 *
 * In http://www.usenix.org/events/usenix01/kroeger.html a similar (albeit
 * more mathematically sophisticated) model is used to predict future file
 * accesses and to drive inter-file read-ahead.
 *
 * TODO:
 *
 * Helper functions for reference-counted events.
 *
 */

To be Continued?

--Tags:-[Kernel]-[Filesystems]-----------

1 comment:

Anonymous24 November, 2005 00:24
What about trying some database benchmarks to test? As the IO done by a RDBMS can be all over the place and perhaps benefit more in a measurable sense from better IO prediction.

(assuming the test dataset is greater than available memory). Check out the mysql-bench suite, it has some interesting things that could be worth looking at.

2005-11-23

opahead patch

1 comment: